Transform between iterable of iterables and a multilabel format
Although a list of sets or tuples is a very intuitive format for multilabel data, it is unwieldy to process. This transformer converts between this intuitive format and the supported multilabel format: a (samples x classes) binary matrix indicating the presence of a class label.
Parameters ---------- classes : array-like of shape n_classes
(optional) Indicates an ordering for the class labels. All entries should be unique (cannot contain duplicate classes).
sparse_output : boolean (default: False), Set to true if output binary array is desired in CSR sparse format
Attributes ---------- classes_ : array of labels A copy of the `classes` parameter where provided, or otherwise, the sorted set of classes found when fitting.
Examples -------- >>> from sklearn.preprocessing import MultiLabelBinarizer >>> mlb = MultiLabelBinarizer() >>> mlb.fit_transform((1, 2), (3,)
) array([1, 1, 0],
[0, 0, 1]
) >>> mlb.classes_ array(1, 2, 3
)
>>> mlb.fit_transform({'sci-fi', 'thriller'}, {'comedy'}
) array([0, 1, 1],
[1, 0, 0]
) >>> list(mlb.classes_) 'comedy', 'sci-fi', 'thriller'
A common mistake is to pass in a list, which leads to the following issue:
>>> mlb = MultiLabelBinarizer() >>> mlb.fit('sci-fi', 'thriller', 'comedy'
) MultiLabelBinarizer() >>> mlb.classes_ array('-', 'c', 'd', 'e', 'f', 'h', 'i', 'l', 'm', 'o', 'r', 's', 't',
'y'
, dtype=object)
To correct this, the list of labels should be passed in as:
>>> mlb = MultiLabelBinarizer() >>> mlb.fit(['sci-fi', 'thriller', 'comedy']
) MultiLabelBinarizer() >>> mlb.classes_ array('comedy', 'sci-fi', 'thriller'
, dtype=object)
See also -------- sklearn.preprocessing.OneHotEncoder : encode categorical features using a one-hot aka one-of-K scheme.