BOM handling¶
The byte order mark is a leading marker used by some UTF formats. In practice, applications often see it in CSV exports, XML files, UTF-16 text files, and partner feeds. The library follows iconv-lite behavior for stripping and prepending BOMs.
Decoding with the default policy¶
BOM-aware decoders strip an initial decoded U+FEFF by default:
auto text = polycpp::iconv_lite::decode(bytes, "utf8");
Only an initial BOM is affected. U+FEFF elsewhere in the decoded text is normal content.
Keeping a BOM as content¶
Disable stripping when the BOM is meaningful data:
polycpp::iconv_lite::DecodeOptions options;
options.stripBOM = false;
auto text = polycpp::iconv_lite::decode(bytes, "utf8", options);
When stripBOM is false, onBOMStripped is not called because no BOM
is removed.
Observing BOM removal¶
Observe actual BOM removal with onBOMStripped:
bool removed = false;
polycpp::iconv_lite::DecodeOptions options;
options.onBOMStripped = [&] { removed = true; };
auto text = polycpp::iconv_lite::decode(bytes, "utf8", options);
The callback is useful for telemetry and compatibility checks: it tells you that the payload included an initial BOM without requiring the application to keep U+FEFF in the returned string.
Encoding with a BOM¶
For encoding, utf16 and utf32 auto encoders add a BOM by default.
Other BOM-aware encodings add one only when requested:
polycpp::iconv_lite::EncodeOptions options;
options.addBOM = true;
auto bytes = polycpp::iconv_lite::encode("hello", "utf8", options);
Set addBOM=false to suppress the default BOM on utf16 or utf32.
Choosing a policy¶
Situation |
Suggested policy |
|---|---|
Reading modern UTF-8 text |
Keep the default |
Preserving exact text content for a diff or editor |
Use |
Exporting for a system that requires UTF-8 BOM |
Set |
Exporting UTF-16 or UTF-32 with auto endianness |
Use the default BOM unless the target explicitly forbids it. |