package syndic

  1. Overview
  2. Docs

Syndic.Atom: RFC 4287 compliant Atom parser.

module Error : module type of Syndic_error

The common signature that all error modules must (at least) satisfy.

Structure of Atom document

type text_construct =
  1. | Text of string
    (*

    Text(content)

    *)
  2. | Html of Uri.t option * string
    (*

    Html(xmlbase, content) where the content is left unparsed.

    *)
  3. | Xhtml of Uri.t option * Syndic_xml.t list
    (*

    Xhtml(xmlbase, content)

    *)

A text construct. It contains human-readable text, usually in small quantities. The content of Text constructs is Language-Sensitive.

Since the constructors Text, Html or Xhtml are shadowed by those of the same name in the definition of content, you may need a type annotation to disambiguate the two.

type author = {
  1. name : string;
  2. uri : Uri.t option;
  3. email : string option;
}

Describes a person, corporation, or similar entity (hereafter, 'person') that indicates the author of the entry or feed. See RFC 4287 § 3.2. Person constructs allow extension Metadata elements (see Section 6.4).

They are used for authors (See RFC 4287 § 4.2.1) and contributors (See RFC 4287 § 4.2.3)

val author : ?uri:Uri.t -> ?email:string -> string -> author
type category = {
  1. term : string;
  2. scheme : Uri.t option;
  3. label : string option;
}

The category element conveys information about a category associated with an entry or feed. This specification assigns no meaning to the content (if any) of this element. See RFC 4287 § 4.2.2.

  • term is a string that identifies the category to which the entry or feed belongs. See RFC 4287 § 4.2.2.2
  • scheme, if present, is an IRI that identifies a categorization scheme. See RFC 4287 § 4.2.2.3
  • label, if present, is a human-readable label for display in end-user applications. The content of the "label" attribute is Language-Sensitive. See RFC 4287 § 4.2.2.1
val category : ?scheme:Uri.t -> ?label:string -> string -> category
type generator = {
  1. version : string option;
  2. uri : Uri.t option;
  3. content : string;
}

The generator element's content identifies the agent used to generate a feed, for debugging and other purposes.

  • content is a human-readable name for the generating agent.
  • uri, if present, SHOULD produce a representation that is relevant to that agent.
  • version, if present, indicates the version of the generating agent.

See RFC 4287 § 4.2.4.

val generator : ?uri:Uri.t -> ?version:string -> string -> generator
type icon = Uri.t

The icon element's content is an IRI reference RFC3987 that identifies an image that provides iconic visual identification for a feed.

The image SHOULD have an aspect ratio of one (horizontal) to one (vertical) and SHOULD be suitable for presentation at a small size.

See RFC 4287 § 4.2.5

type id = Uri.t

The id element conveys a permanent, universally unique identifier for an entry or feed.

Its content MUST be an IRI, as defined by RFC3987. Note that the definition of "IRI" excludes relative references. Though the IRI might use a dereferencable scheme, Atom Processors MUST NOT assume it can be dereferenced.

There is more information in the RFC but they are not necessary here, at least, they can not be checked here.

See RFC 4287 § 4.2.6

type rel =
  1. | Alternate
    (*

    Signifies that the URI in the value of the link href field identifies an alternate version of the resource described by the containing element.

    *)
  2. | Related
    (*

    Signifies that the URI in the value of the link href field identifies a resource related to the resource described by the containing element.

    *)
  3. | Self
    (*

    Signifies that the URI in the value of the link href field identifies a resource equivalent to the containing element.

    *)
  4. | Enclosure
    (*

    Signifies that the IRI in the value of the link href field identifies a related resource that is potentially large in size and might require special handling. When Enclosure is specified, the length attribute SHOULD be provided.

    *)
  5. | Via
    (*

    Signifies that the IRI in the value of the link href field identifies a resource that is the source of the information provided in the containing element.

    *)

Indicates the link relation type. See RFC 4287 § 4.2.7.2.

link defines a reference from an entry or feed to a Web resource. See RFC 4287 § 4.2.7.

  • href contains the link's IRI. The value MUST be a IRI reference, RFC3987. See RFC 4287 § 4.2.7.1.
  • type_media is an advisory media type: it is a hint about the type of the representation that is expected to be returned when the value of the href attribute is dereferenced. Note that the type attribute does not override the actual media type returned with the representation. The value of type_media, if given, MUST conform to the syntax of a MIME media type, MIMEREG. See RFC 4287 § 4.2.7.3.
  • hreflang describes the language of the resource pointed to by the href attribute. When used together with the rel=Alternate, it implies a translated version of the entry. The value of hreflang MUST be a language tag, RFC3066. See RFC 4287 § 4.2.7.4.
  • title conveys human-readable information about the link. The content of the "title" attribute is Language-Sensitive. The value "" means that no title is provided. See RFC 4287 § 4.2.7.5.
  • length indicates an advisory length of the linked content in octets; it is a hint about the content length of the representation returned when the IRI in the href attribute is mapped to a URI and dereferenced. Note that the length attribute does not override the actual content length of the representation as reported by the underlying protocol. See RFC 4287 § 4.2.7.6.

link uri creates a link element.

  • parameter rel

    The rel attribute of the link. It defaults to Alternate since RFC 4287 § 4.2.7.2 says that if the "rel" attribute is not present, the link element MUST be interpreted as if the link relation type is "alternate".

    The other optional arguments all default to None (i.e., not specified).

logo is an IRI reference RFC3987 that identifies an image that provides visual identification for a feed.

The image SHOULD have an aspect ratio of 2 (horizontal) to 1 (vertical).

See RFC 4287 § 4.2.8

type published = Syndic_date.t

published is a Date construct indicating an instant in time associated with an event early in the life cycle of the entry.

Typically, published will be associated with the initial creation or first availability of the resource.

See RFC 4287 § 4.2.9

type rights = text_construct

rights is a Text construct that conveys information about rights held in and over an entry or feed. The rights element SHOULD NOT be used to convey machine-readable licensing information.

If an atom:entry element does not contain an atom:rights element, then the atom:rights element of the containing atom:feed element, if present, is considered to apply to the entry.

See RFC 4287 § 4.2.10

type title = text_construct

title is a Text construct that conveys a human-readable title for an entry or feed. See RFC 4287 § 4.2.14

type subtitle = text_construct

subtitle is a Text construct that conveys a human-readable description or subtitle for a feed. See RFC 4287 § 4.2.12

type updated = Syndic_date.t

updated is a Date construct indicating the most recent instant in time when an entry or feed was modified in a way the publisher considers significant. Therefore, not all modifications necessarily result in a changed updated value.

Publishers MAY change the value of this element over time.

See RFC 4287 § 4.2.15

type source = {
  1. authors : author list;
  2. categories : category list;
  3. contributors : author list;
  4. generator : generator option;
  5. icon : icon option;
  6. id : id;
  7. rights : rights option;
  8. subtitle : subtitle option;
  9. title : title;
  10. updated : updated option;
}

If an entry is copied from one feed into another feed, then the source feed's metadata (all child elements of atom:feed other than the atom:entry elements) MAY be preserved within the copied entry by adding an atom:source child element, if it is not already present in the entry, and including some or all of the source feed's Metadata elements as the atom:source element's children. Such metadata SHOULD be preserved if the source atom:feed contains any of the child elements atom:author, atom:contributor, atom:rights, or atom:category and those child elements are not present in the source atom:entry.

See RFC 4287 § 4.2.11

The atom:source element is designed to allow the aggregation of entries from different feeds while retaining information about an entry's source feed. For this reason, Atom Processors that are performing such aggregation SHOULD include at least the required feed-level Metadata fields (id, title, and updated) in the source element.

See RFC 4287 § 4.1.2 for more details.

val source : ?categories:category list -> ?contributors:author list -> ?generator:generator -> ?icon:icon -> ?links:link list -> ?logo:logo -> ?rights:rights -> ?subtitle:subtitle -> ?updated:updated -> authors:author list -> id:id -> title:title -> source
type mime = string

A MIME type that conform to the syntax of a MIME media type, but MUST NOT be a composite type (see Section 4.2.6 of MIMEREG).

See RFC 4287 § 4.1.3.1

type content =
  1. | Text of string
  2. | Html of Uri.t option * string
  3. | Xhtml of Uri.t option * Syndic_xml.t list
  4. | Mime of mime * string
  5. | Src of mime option * Uri.t

content either contains or links to the content of the entry. The value of content is Language-Sensitive. See RFC 4287 § 4.1.3

  • Text, Html, Xhtml or Mime means that the content was part of the document and is provided as an argument. The first argument to Html and Xhtml is the possible xml:base value. See RFC 4287 § 3.1.1
  • Src(m, iri) means that the content is to be found at iri and has MIME type m. Atom Processors MAY use the IRI to retrieve the content and MAY choose to ignore remote content or to present it in a different manner than local content. The value of m is advisory; that is to say, when the corresponding URI (mapped from an IRI, if necessary) is dereferenced, if the server providing that content also provides a media type, the server-provided media type is authoritative. See RFC 4287 § 4.1.3.2
type summary = text_construct

summary is a Text construct that conveys a short summary, abstract, or excerpt of an entry.

It is not advisable for summary to duplicate title or content because Atom Processors might assume there is a useful summary when there is none.

See RFC 4287 § 4.2.13

type entry = {
  1. authors : author * author list;
  2. categories : category list;
  3. content : content option;
  4. contributors : author list;
  5. id : id;
  6. published : published option;
  7. rights : rights option;
  8. source : source option;
  9. summary : summary option;
  10. title : title;
  11. updated : updated;
}

entry represents an individual entry, acting as a container for metadata and data associated with the entry. This element can appear as a child of the atom:feed element, or it can appear as the document (i.e., top-level) element of a stand-alone Atom Entry Document.

The specification mandates that each entry contains an author unless it contains some sources or the feed contains an author element. This library ensures that the authors are properly dispatched to all locations.

The following child elements are defined by this specification (note that it requires the presence of some of these elements):

  • if content = None, then links MUST contain at least one element with a rel attribute value of Alternate.
  • There MUST NOT be more than one element of links with a rel attribute value of Alternate that has the same combination of type and hreflang attribute values.
  • There MAY be additional elements of links beyond those described above.
  • There MUST be an summary in either of the following cases:

    • the atom:entry contains an atom:content that has a "src" attribute (and is thus empty).
    • the atom:entry contains content that is encoded in Base64; i.e., the "type" attribute of atom:content is a MIME media type MIMEREG, but is not an XML media type RFC3023, does not begin with "text/", and does not end with "/xml" or "+xml".

See RFC 4287 § 4.1.2

val entry : ?categories:category list -> ?content:content -> ?contributors:author list -> ?links:link list -> ?published:published -> ?rights:rights -> ?source:source -> ?summary:summary -> id:id -> authors:(author * author list) -> title:title -> updated:updated -> unit -> entry
type feed = {
  1. authors : author list;
  2. categories : category list;
  3. contributors : author list;
  4. generator : generator option;
  5. icon : icon option;
  6. id : id;
  7. rights : rights option;
  8. subtitle : subtitle option;
  9. title : title;
  10. updated : updated;
  11. entries : entry list;
}

feed is the document (i.e., top-level) element of an Atom Feed Document, acting as a container for metadata and data associated with the feed. Its element children consist of metadata elements followed by zero or more atom:entry child elements.

  • one of the links SHOULD have a rel attribute value of Self. This is the preferred URI for retrieving Atom Feed Documents representing this Atom feed.
  • There MUST NOT be more than one element of links with a rel attribute value of Alternate that has the same combination of type and hreflang attribute values.
  • There may be additional elements in links beyond those described above.

If multiple entry elements with the same id value appear in an Atom Feed Document, they represent the same entry. Their updated timestamps SHOULD be different. If an Atom Feed Document contains multiple entries with the same id, Atom Processors MAY choose to display all of them or some subset of them. One typical behavior would be to display only the entry with the latest updated timestamp.

See RFC 4287 § 4.1.1

val feed : ?authors:author list -> ?categories:category list -> ?contributors:author list -> ?generator:generator -> ?icon:icon -> ?links:link list -> ?logo:logo -> ?rights:rights -> ?subtitle:subtitle -> id:id -> title:title -> updated:updated -> entry list -> feed

Input and output

val parse : ?self:Uri.t -> ?xmlbase:Uri.t -> Xmlm.input -> feed

parse xml returns the feed corresponding to xml. Beware that xml is mutable, so when the parsing fails, one has to create a new copy of xml to use it with another function. If you retrieve xml from a URL, you should use that URL as ~xmlbase.

Raise Error.Expected, Expected_Data or Error.Duplicate_Link if xml is not a valid Atom document.

  • parameter xmlbase

    default xml:base to resolve relative URLs (of course xml:base attributes in the XML Atom document take precedence over this). See XML Base.

  • parameter self

    the URI from where the current feed was retrieved. Giving this information will add an entry to links with rel = Self unless one already exists.

val read : ?self:Uri.t -> ?xmlbase:Uri.t -> string -> feed

read fname reads the file name fname and parses it. For the optional parameters, see parse.

val to_xml : feed -> Syndic_xml.t

to_xml f converts the feed f to an XML tree.

val output : feed -> Xmlm.dest -> unit

output f dest writes the XML tree of the feed f to dest.

val write : feed -> string -> unit

write f fname writes the XML tree of the feed f to the file named fname.

Convenience functions

val ascending : entry -> entry -> int

Compare entries so that older dates are smaller. The date of the entry is taken from the published field, if available, or otherwise updated is used.

val descending : entry -> entry -> int

Compare entries so that more recent dates are smaller. The date of the entry is taken from the published field, if available, or otherwise updated is used.

val aggregate : ?self:Uri.t -> ?id:id -> ?updated:updated -> ?subtitle:subtitle -> ?title:text_construct -> ?sort:[ `Newest_first | `Oldest_first | `None ] -> ?n:int -> feed list -> feed

aggregate feeds returns a single feed containing all the posts in feeds. In order to track the origin of each post in the aggrated feed, it is recommended that each feed in feeds possesses a link with rel = Self so that the source added to each entry contains a link to the original feed. If an entry contains a source, il will not be overwritten.

  • parameter self

    The preferred URI for retrieving this aggregayed Atom Feed. While not mandatory, it is good practice to set this.

  • parameter id

    the universally unique identifier for the aggregated feed. If it is not provided a URN is built from the feeds IDs.

  • parameter sort

    whether to sort the entries of the final feed. The default is `Newest_first because it is generally desired.

  • parameter n

    number of entries of the (sorted) aggregated feed to return.

set_self feed url add or replace the URI in the self link of the feed. You can also set the hreflang and length of the self link.

get_self feed return the self link of the feed, if any is present.

val set_main_author : feed -> author -> feed

set_main_author feed author will add author in front of the list of authors of the feed (if an author with the same name already exists, the optional information are merged, the ones in author taking precedence). Also remove all empty authors (name = "" and no URI, no email) and replace them with author if no author is left and an authors is mandatory.