
class, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, lower=False, tokenize=None, tokenizer_language='en', include_lengths=False, batch_first=False, pad_token='<pad>', unk_token='<unk>', pad_first=False, truncate_first=False, stop_words=None, is_target=False)


Defines a datatype together with instructions for converting to Tensor.

Field class models common text processing datatypes that can be represented by tensors. It holds a Vocab object that defines the set of possible values for elements of the field and their corresponding numerical representations. The Field object also holds other parameters relating to how a datatype should be numericalized, such as a tokenization method and the kind of Tensor that should be produced.

If a Field is shared between two columns in a dataset (e.g., question and answer in a QA dataset), then they will have a shared vocabulary.


  • sequential – Whether the datatype represents sequential data. If False, no tokenization is applied. Default: True.
  • use_vocab – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
  • init_token – A token that will be prepended to every example using this field, or None for no initial token. Default: None.
  • eos_token – A token that will be appended to every example using this field, or None for no end-of-sentence token. Default: None.
  • fix_length – A fixed length that all examples using this field will be padded to, or None for flexible sequence lengths. Default: None.
  • dtype – The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.
  • preprocessing – The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default: None.
  • postprocessing – A Pipeline that will be applied to examples using this field after numericalizing but before the numbers are turned into a Tensor. The pipeline function takes the batch as a list, and the field’s Vocab. Default: None.
  • lower – Whether to lowercase the text in this field. Default: False.
  • tokenize – The function used to tokenize strings using this field into sequential examples. If “spacy”, the SpaCy tokenizer is used. If a non-serializable function is passed as an argument, the field will not be able to be serialized. Default: string.split.
  • tokenizer_language – The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.
  • include_lengths – Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.
  • batch_first – Whether to produce tensors with the batch dimension first. Default: False.
  • pad_token – The string token used as padding. Default: “”.
  • unk_token – The string token used to represent OOV words. Default: “”.
  • pad_first – Do the padding of the sequence at the beginning. Default: False.
  • truncate_first – Do the truncating of the sequence at the beginning. Default: False
  • stop_words – Tokens to discard during the preprocessing step. Default: None
  is_target – Whether this field is a target variable. Affects iteration over batches. Default: False

