Quantization Operators¶

Quantization is a model optimization technique to reduce the size of a large model in order to achieve better storage performance with a small loss in accuracy.

CUDA Operators¶

at::Tensor _float_to_bfloat16_gpu(const at::Tensor &input)¶

Converts a tensor of float values into a tensor of Brain Floating Point (bfloat16) values.

Parameters:: input – A tensor of float values
Returns:: A new tensor with values from the input tensor converted to bfloat16.

at::Tensor _bfloat16_to_float_gpu(const at::Tensor &input)¶

Converts a tensor of Brain Floating Point (bfloat16) values into a tensor of float values.

Parameters:: input – A tensor of bfloat16 values
Returns:: A new tensor with values from the input tensor converted to float.

Tensor _float_to_FP8rowwise_gpu(const Tensor &input, const bool forward)¶

Converts a tensor of float values into a tensor of fp8 values.

Parameters:

input – A tensor of float values. The dtype can be either SparseType::FP32, SparseType::FP16, or SparseType::BF16
forward –

Throws:

c10::Error – if input.dtype is not one of (SparseType::FP32, SparseType::FP16, or SparseType::BF16).

Returns:

A new tensor with values from the input tensor converted to fp8.

at::Tensor _FP8rowwise_to_float_gpu(const at::Tensor &input, bool forward, const int64_t output_dtype)¶

Converts a tensor of fp8 values into a tensor of float values.

Parameters:

input – A tensor of fp8 values
forward –
output_dtype – The target floating point type, specified as integer representation of SparseType enum

Throws:

c10::Error – if output_dtype is not one of (SparseType::FP32, SparseType::FP16, or SparseType::BF16).

Returns:

A new tensor with values from the input tensor converted to float (with dtype of either SparseType::FP32, SparseType::FP16, or SparseType::BF16).

Tensor _float_to_fused8bitrowwise_gpu(const Tensor &input)¶

Converts a tensor of float values into a tensor of fused 8-bit rowwise values.

Parameters:: input – A tensor of float values
Returns:: A new tensor with values from the input tensor converted to fused 8-bit rowwise.

Tensor _half_to_fused8bitrowwise_gpu(const Tensor &input)¶

Converts a tensor of at::Half values into a tensor of fused 8-bit rowwise values.

Parameters:: input – A tensor of at::Half values
Returns:: A new tensor with values from the input tensor converted to fused 8-bit rowwise.

Tensor _single_or_half_precision_to_fused8bitrowwise_gpu(const Tensor &input)¶

Converts a tensor of at::Single or at::Half values into a tensor of fused 8-bit rowwise values.

Parameters:: input – A tensor of at::Single or at::Half values
Returns:: A new tensor with values from the input tensor converted to fused 8-bit rowwise.

at::Tensor _fused8bitrowwise_to_float_gpu(const at::Tensor &input)¶

Converts a tensor of fused 8-bit rowwise values into a tensor of float values.

Parameters:: input – A tensor of fused 8-bit rowwise values
Returns:: A new tensor with values from the input tensor converted to float.

at::Tensor _fused8bitrowwise_to_half_gpu(const at::Tensor &input)¶

Converts a tensor of fused 8-bit rowwise values into a tensor of at::Half values.

Parameters:: input – A tensor of fused 8-bit rowwise values
Returns:: A new tensor with values from the input tensor converted to at::Half.

at::Tensor _fused8bitrowwise_to_single_or_half_precision_gpu(const at::Tensor &input, const int64_t output_dtype, const bool scale_bias_last, const bool quant_padding_float_type)¶

Converts a tensor of fused 8-bit rowwise values into a tensor of float, at::Half, or at::BFloat16 values.

Parameters:

input – A tensor of fused 8-bit rowwise values
output_dtype – The target floating point type, specified as integer representation of SparseType enum

Throws:

c10::Error – if output_dtype is not one of (SparseType::FP32, SparseType::FP16, or SparseType::BF16).

Returns:

A new tensor with values from the input tensor converted to float, at::Half, or at::BFloat16.

at::Tensor _fused8bitrowwise_to_float_mixed_dim_gpu(const at::Tensor &input, const at::Tensor &D_offsets, const int64_t output_dtype)¶

Converts a tensor of fused 8-bit rowwise values into a tensor of at::kFloat or at::kHalf values.

Parameters:

input – A tensor of fused 8-bit rowwise values
D_offsets –
output_dtype – The target floating point type, specified as integer representation of SparseType enum

Throws:

c10::Error – if output_dtype is not one of (SparseType::FP32, SparseType::FP16)

Returns:

A new tensor with values from the input tensor converted to at::kFloat or at::kHalf.

Tensor _float_to_fusednbitrowwise_gpu(const Tensor &input, const int64_t bit_rate)¶

Converts a tensor of float values into a tensor of fused N-bit rowwise values.

Parameters:

input – A tensor of float values
bit_rate –

Returns:

A new tensor with values from the input tensor converted to fused N-bit rowwise.

at::Tensor _half_to_fusednbitrowwise_gpu(const at::Tensor &input, const int64_t bit_rate)¶

Converts a tensor of at::Half values into a tensor of fused N-bit rowwise values.

Parameters:

input – A tensor of at::Half values
bit_rate –

Returns:

A new tensor with values from the input tensor converted to fused N-bit rowwise.

Tensor _single_or_half_precision_to_fusednbitrowwise_gpu(const Tensor &input, const int64_t bit_rate)¶

Converts a tensor of float or at::Half values into a tensor of fused N-bit rowwise values.

Parameters:

input – A tensor of float or at::Half values
bit_rate –

Returns:

A new tensor with values from the input tensor converted to fused N-bit rowwise.

at::Tensor _fusednbitrowwise_to_float_gpu(const at::Tensor &input, const int64_t bit_rate)¶

Converts a tensor of fused N-bit rowwise values into a tensor of float values.

Parameters:

input – A tensor of fused N-bit rowwise values
bit_rate –

Returns:

A new tensor with values from the input tensor converted to float.

at::Tensor _fusednbitrowwise_to_half_gpu(const at::Tensor &input, const int64_t bit_rate)¶

Converts a tensor of fused N-bit rowwise values into a tensor of at::Half values.

Parameters:

input – A tensor of fused N-bit rowwise values
bit_rate –

Returns:

A new tensor with values from the input tensor converted to at::Half.

at::Tensor _fusednbitrowwise_to_single_or_half_precision_gpu(const at::Tensor &input, const int64_t bit_rate, const int64_t output_dtype)¶

Converts a tensor of fused N-bit rowwise values into a tensor of float or at::Half or at::Bf16 values.

Parameters:

input – A tensor of fused N-bit rowwise values
bit_rate –
output_dtype – The target floating point type, specified as integer representation of SparseType enum

Throws:

c10::Error – if output_dtype is not one of (SparseType::FP32 or SparseType::FP16 or SparseType::BF16).

Returns:

A new tensor with values from the input tensor converted to float or at::Half or at::Bf16, depending on output_dtype.

at::Tensor _float_to_hfp8_gpu(const at::Tensor &input, const int64_t ebits, const int64_t exponent_bias, const double max_pos)¶

Converts a tensor of float values into a tensor of Hybrid 8-bit Floating Point (hfp8) values.

Parameters:

input – A tensor of float values
ebits –
exponent_bias –
max_pos –

Throws:

c10::Error – if ebits > 0 or exponent_bias > 0.

Returns:

A new tensor with values from the input tensor converted to hfp8.

at::Tensor _hfp8_to_float_gpu(const at::Tensor &input, const int64_t ebits, const int64_t exponent_bias)¶

Converts a tensor of Hybrid 8-bit Floating Point (hfp8) values into a tensor of float values.

Parameters:

input – A tensor of hfp8 values
ebits –
exponent_bias –

Throws:

c10::Error – if ebits > 0 or exponent_bias > 0.

Returns:

A new tensor with values from the input tensor converted to float.

at::Tensor _float_to_msfp_gpu(const at::Tensor &input, const int64_t bounding_box_size, const int64_t ebits, const int64_t mbits, const int64_t bias, const double min_pos, const double max_pos)¶

Converts a tensor of float values into a tensor of Microsoft Floating Point (msfp) values.

Parameters:

input – A tensor of float values
bounding_box_size –
ebits –
mbits –
bias –
min_pos –
max_pos –

Returns:

A new tensor with values from the input tensor converted to msfp.

at::Tensor _msfp_to_float_gpu(const at::Tensor &input, const int64_t ebits, const int64_t mbits, const int64_t bias)¶

Converts a tensor of Microsoft Floating Point (msfp) values into a tensor of float values.

Parameters:

input – A tensor of msfp values
ebits –
mbits –
bias –

Returns:

A new tensor with values from the input tensor converted to float.

Tensor _float_to_paddedFP8rowwise_gpu(const Tensor &input, const bool forward, const int64_t row_dim)¶

Converts a tensor of float values into a tensor of padded fp8 rowwise values.

Parameters:

input – A tensor of float values. The dtype can be either SparseType::FP32, SparseType::FP16, or SparseType::BF16
forward –
row_dim –

Returns:

A new tensor with values from the input tensor converted to padded fp8 rowwise.

at::Tensor _paddedFP8rowwise_to_float_gpu(const at::Tensor &input, const bool forward, const int64_t row_dim, const int64_t output_last_dim, const int64_t output_dtype)¶

Converts a tensor of padded fp8 rowwise values into a tensor of float values.

Parameters:

input – A tensor of float values. The dtype can be either SparseType::FP32, SparseType::FP16, or SparseType::BF16
forward –
row_dim –
output_last_dim –
output_dtype – The target floating point type, specified as integer representation of SparseType enum

Throws:

c10::Error – if output_dtype is not one of (SparseType::FP32, SparseType::FP16, SparseType::BF16).

Returns:

A new tensor with values from the input tensor converted to float.

CPU Operators¶

Tensor &_fused8bitrowwise_to_float_cpu_out(Tensor &output, const Tensor &input)¶

Tensor &_float_to_fused8bitrowwise_cpu_out(Tensor &output, const Tensor &input)¶

Tensor float_to_fused8bitrowwise_cpu(const Tensor &input)¶

Tensor half_to_fused8bitrowwise_cpu(const Tensor &input)¶

Tensor float_or_half_to_fused8bitrowwise_cpu(const Tensor &input)¶

Tensor fused8bitrowwise_to_float_cpu(const Tensor &input)¶

Tensor fused8bitrowwise_to_half_cpu(const Tensor &input)¶

Tensor fused8bitrowwise_to_float_or_half_cpu(const Tensor &input, const int64_t output_dtype, const bool scale_bias_last, const bool quant_padding_float_type)¶

Tensor float_to_FP8rowwise_cpu(const Tensor &input, bool forward)¶

Tensor FP8rowwise_to_float_cpu(const Tensor &input, bool forward, const int64_t output_dtype)¶

Tensor fusednbitrowwise_to_float_cpu(const Tensor &input, const int64_t bit_rate)¶

Tensor fusednbitrowwise_to_half_cpu(const Tensor &input, const int64_t bit_rate)¶

Tensor fusednbitrowwise_to_float_or_half_cpu(const Tensor &input, const int64_t bit_rate, const int64_t output_dtype)¶

void FloatToFP8Quantized_ref(const float *const input, const size_t nrows, const size_t ncols, uint8_t *const output, const int ebits, const int exponent_bias, const double max_pos)¶

void FP8QuantizedToFloat_ref(const uint8_t *const input, const size_t nrows, const size_t ncols, float *const output, const int ebits, const int exponent_bias)¶

Quantization Operators¶

CUDA Operators¶

CPU Operators¶

Docs

Tutorials

Resources