• Docs >
  • Quantization Operators
Shortcuts

Quantization Operators

Quantization is a model optimization technique to reduce the size of a large model in order to achieve better storage performance with a small loss in accuracy.

CUDA Operators

at::Tensor _float_to_bfloat16_gpu(const at::Tensor &input)

Converts a tensor of float values into a tensor of Brain Floating Point (bfloat16) values.

Parameters:

input – A tensor of float values

Returns:

A new tensor with values from the input tensor converted to bfloat16.

at::Tensor _bfloat16_to_float_gpu(const at::Tensor &input)

Converts a tensor of Brain Floating Point (bfloat16) values into a tensor of float values.

Parameters:

input – A tensor of bfloat16 values

Returns:

A new tensor with values from the input tensor converted to float.

Tensor _float_to_FP8rowwise_gpu(const Tensor &input, const bool forward)

Converts a tensor of float values into a tensor of fp8 values.

Parameters:
  • input – A tensor of float values. The dtype can be either SparseType::FP32, SparseType::FP16, or SparseType::BF16

  • forward

Throws:

c10::Error – if input.dtype is not one of (SparseType::FP32, SparseType::FP16, or SparseType::BF16).

Returns:

A new tensor with values from the input tensor converted to fp8.

at::Tensor _FP8rowwise_to_float_gpu(const at::Tensor &input, bool forward, const int64_t output_dtype)

Converts a tensor of fp8 values into a tensor of float values.

Parameters:
  • input – A tensor of fp8 values

  • forward

  • output_dtype – The target floating point type, specified as integer representation of SparseType enum

Throws:

c10::Error – if output_dtype is not one of (SparseType::FP32, SparseType::FP16, or SparseType::BF16).

Returns:

A new tensor with values from the input tensor converted to float (with dtype of either SparseType::FP32, SparseType::FP16, or SparseType::BF16).

Tensor _float_to_fused8bitrowwise_gpu(const Tensor &input)

Converts a tensor of float values into a tensor of fused 8-bit rowwise values.

Parameters:

input – A tensor of float values

Returns:

A new tensor with values from the input tensor converted to fused 8-bit rowwise.

Tensor _half_to_fused8bitrowwise_gpu(const Tensor &input)

Converts a tensor of at::Half values into a tensor of fused 8-bit rowwise values.

Parameters:

input – A tensor of at::Half values

Returns:

A new tensor with values from the input tensor converted to fused 8-bit rowwise.

Tensor _single_or_half_precision_to_fused8bitrowwise_gpu(const Tensor &input)

Converts a tensor of at::Single or at::Half values into a tensor of fused 8-bit rowwise values.

Parameters:

input – A tensor of at::Single or at::Half values

Returns:

A new tensor with values from the input tensor converted to fused 8-bit rowwise.

at::Tensor _fused8bitrowwise_to_float_gpu(const at::Tensor &input)

Converts a tensor of fused 8-bit rowwise values into a tensor of float values.

Parameters:

input – A tensor of fused 8-bit rowwise values

Returns:

A new tensor with values from the input tensor converted to float.

at::Tensor _fused8bitrowwise_to_half_gpu(const at::Tensor &input)

Converts a tensor of fused 8-bit rowwise values into a tensor of at::Half values.

Parameters:

input – A tensor of fused 8-bit rowwise values

Returns:

A new tensor with values from the input tensor converted to at::Half.

at::Tensor _fused8bitrowwise_to_single_or_half_precision_gpu(const at::Tensor &input, const int64_t output_dtype, const bool scale_bias_last, const bool quant_padding_float_type)

Converts a tensor of fused 8-bit rowwise values into a tensor of float, at::Half, or at::BFloat16 values.

Parameters:
  • input – A tensor of fused 8-bit rowwise values

  • output_dtype – The target floating point type, specified as integer representation of SparseType enum

Throws:

c10::Error – if output_dtype is not one of (SparseType::FP32, SparseType::FP16, or SparseType::BF16).

Returns:

A new tensor with values from the input tensor converted to float, at::Half, or at::BFloat16.

at::Tensor _fused8bitrowwise_to_float_mixed_dim_gpu(const at::Tensor &input, const at::Tensor &D_offsets, const int64_t output_dtype)

Converts a tensor of fused 8-bit rowwise values into a tensor of at::kFloat or at::kHalf values.

Parameters:
  • input – A tensor of fused 8-bit rowwise values

  • D_offsets

  • output_dtype – The target floating point type, specified as integer representation of SparseType enum

Throws:

c10::Error – if output_dtype is not one of (SparseType::FP32, SparseType::FP16)

Returns:

A new tensor with values from the input tensor converted to at::kFloat or at::kHalf.

Tensor _float_to_fusednbitrowwise_gpu(const Tensor &input, const int64_t bit_rate)

Converts a tensor of float values into a tensor of fused N-bit rowwise values.

Parameters:
  • input – A tensor of float values

  • bit_rate

Returns:

A new tensor with values from the input tensor converted to fused N-bit rowwise.

at::Tensor _half_to_fusednbitrowwise_gpu(const at::Tensor &input, const int64_t bit_rate)

Converts a tensor of at::Half values into a tensor of fused N-bit rowwise values.

Parameters:
  • input – A tensor of at::Half values

  • bit_rate

Returns:

A new tensor with values from the input tensor converted to fused N-bit rowwise.

Tensor _single_or_half_precision_to_fusednbitrowwise_gpu(const Tensor &input, const int64_t bit_rate)

Converts a tensor of float or at::Half values into a tensor of fused N-bit rowwise values.

Parameters:
  • input – A tensor of float or at::Half values

  • bit_rate

Returns:

A new tensor with values from the input tensor converted to fused N-bit rowwise.

at::Tensor _fusednbitrowwise_to_float_gpu(const at::Tensor &input, const int64_t bit_rate)

Converts a tensor of fused N-bit rowwise values into a tensor of float values.

Parameters:
  • input – A tensor of fused N-bit rowwise values

  • bit_rate

Returns:

A new tensor with values from the input tensor converted to float.

at::Tensor _fusednbitrowwise_to_half_gpu(const at::Tensor &input, const int64_t bit_rate)

Converts a tensor of fused N-bit rowwise values into a tensor of at::Half values.

Parameters:
  • input – A tensor of fused N-bit rowwise values

  • bit_rate

Returns:

A new tensor with values from the input tensor converted to at::Half.

at::Tensor _fusednbitrowwise_to_single_or_half_precision_gpu(const at::Tensor &input, const int64_t bit_rate, const int64_t output_dtype)

Converts a tensor of fused N-bit rowwise values into a tensor of float or at::Half or at::Bf16 values.

Parameters:
  • input – A tensor of fused N-bit rowwise values

  • bit_rate

  • output_dtype – The target floating point type, specified as integer representation of SparseType enum

Throws:

c10::Error – if output_dtype is not one of (SparseType::FP32 or SparseType::FP16 or SparseType::BF16).

Returns:

A new tensor with values from the input tensor converted to float or at::Half or at::Bf16, depending on output_dtype.

at::Tensor _float_to_hfp8_gpu(const at::Tensor &input, const int64_t ebits, const int64_t exponent_bias, const double max_pos)

Converts a tensor of float values into a tensor of Hybrid 8-bit Floating Point (hfp8) values.

Parameters:
  • input – A tensor of float values

  • ebits

  • exponent_bias

  • max_pos

Throws:

c10::Error – if ebits > 0 or exponent_bias > 0.

Returns:

A new tensor with values from the input tensor converted to hfp8.

at::Tensor _hfp8_to_float_gpu(const at::Tensor &input, const int64_t ebits, const int64_t exponent_bias)

Converts a tensor of Hybrid 8-bit Floating Point (hfp8) values into a tensor of float values.

Parameters:
  • input – A tensor of hfp8 values

  • ebits

  • exponent_bias

Throws:

c10::Error – if ebits > 0 or exponent_bias > 0.

Returns:

A new tensor with values from the input tensor converted to float.

at::Tensor _float_to_msfp_gpu(const at::Tensor &input, const int64_t bounding_box_size, const int64_t ebits, const int64_t mbits, const int64_t bias, const double min_pos, const double max_pos)

Converts a tensor of float values into a tensor of Microsoft Floating Point (msfp) values.

Parameters:
  • input – A tensor of float values

  • bounding_box_size

  • ebits

  • mbits

  • bias

  • min_pos

  • max_pos

Returns:

A new tensor with values from the input tensor converted to msfp.

at::Tensor _msfp_to_float_gpu(const at::Tensor &input, const int64_t ebits, const int64_t mbits, const int64_t bias)

Converts a tensor of Microsoft Floating Point (msfp) values into a tensor of float values.

Parameters:
  • input – A tensor of msfp values

  • ebits

  • mbits

  • bias

Returns:

A new tensor with values from the input tensor converted to float.

Tensor _float_to_paddedFP8rowwise_gpu(const Tensor &input, const bool forward, const int64_t row_dim)

Converts a tensor of float values into a tensor of padded fp8 rowwise values.

Parameters:
  • input – A tensor of float values. The dtype can be either SparseType::FP32, SparseType::FP16, or SparseType::BF16

  • forward

  • row_dim

Returns:

A new tensor with values from the input tensor converted to padded fp8 rowwise.

at::Tensor _paddedFP8rowwise_to_float_gpu(const at::Tensor &input, const bool forward, const int64_t row_dim, const int64_t output_last_dim, const int64_t output_dtype)

Converts a tensor of padded fp8 rowwise values into a tensor of float values.

Parameters:
  • input – A tensor of float values. The dtype can be either SparseType::FP32, SparseType::FP16, or SparseType::BF16

  • forward

  • row_dim

  • output_last_dim

  • output_dtype – The target floating point type, specified as integer representation of SparseType enum

Throws:

c10::Error – if output_dtype is not one of (SparseType::FP32, SparseType::FP16, SparseType::BF16).

Returns:

A new tensor with values from the input tensor converted to float.

CPU Operators

Tensor &_fused8bitrowwise_to_float_cpu_out(Tensor &output, const Tensor &input)
Tensor &_float_to_fused8bitrowwise_cpu_out(Tensor &output, const Tensor &input)
Tensor float_to_fused8bitrowwise_cpu(const Tensor &input)
Tensor half_to_fused8bitrowwise_cpu(const Tensor &input)
Tensor float_or_half_to_fused8bitrowwise_cpu(const Tensor &input)
Tensor fused8bitrowwise_to_float_cpu(const Tensor &input)
Tensor fused8bitrowwise_to_half_cpu(const Tensor &input)
Tensor fused8bitrowwise_to_float_or_half_cpu(const Tensor &input, const int64_t output_dtype, const bool scale_bias_last, const bool quant_padding_float_type)
Tensor float_to_FP8rowwise_cpu(const Tensor &input, bool forward)
Tensor FP8rowwise_to_float_cpu(const Tensor &input, bool forward, const int64_t output_dtype)
Tensor fusednbitrowwise_to_float_cpu(const Tensor &input, const int64_t bit_rate)
Tensor fusednbitrowwise_to_half_cpu(const Tensor &input, const int64_t bit_rate)
Tensor fusednbitrowwise_to_float_or_half_cpu(const Tensor &input, const int64_t bit_rate, const int64_t output_dtype)
void FloatToFP8Quantized_ref(const float *const input, const size_t nrows, const size_t ncols, uint8_t *const output, const int ebits, const int exponent_bias, const double max_pos)
void FP8QuantizedToFloat_ref(const uint8_t *const input, const size_t nrows, const size_t ncols, float *const output, const int ebits, const int exponent_bias)

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources