Switching from Multibyte to Unicode

Use this forum if you want to discuss a problem or ask a question related to a hMailServer beta release.
Post Reply
User avatar
Dravion
Senior user
Senior user
Posts: 2071
Joined: 2015-09-26 11:50
Location: Germany
Contact:

Switching from Multibyte to Unicode

Post by Dravion » 2019-08-12 10:48

As result of a deep dive in hMailServers C/C++ Code, i found a series of functions and datatype favoring MBCS (Multibyte)
for example the Namespace HM:String which is defined as ANSI String.

I found this interesting read in Jeffrey Richters book Windows via C/C++ 5th on Page 15
Since Windows NT, all Windows versions are built from the ground up using Unicode. That is, all
the core functions for creating windows, displaying text, performing string manipulations, and so
forth require Unicode strings. If you call any Windows function passing it an ANSI string (a string
of 1-byte characters), the function first converts the string to Unicode and then passes the Unicode
string to the operating system. If you are expecting ANSI strings back from a function, the system
converts the Unicode string to an ANSI string before returning to your application. All these con-
versions occur invisibly to you. Of course, there is time and memory overhead involved for the system
to carry out all these string conversions.
I think in 2019 we don't need any ANSI and Multibyte Code anymore and should remove it.
We should just stick to Unicode and as result of it we gain some speed on all String Operations and the overall
code base should decrease without loosing any features.

User avatar
martin
Developer
Developer
Posts: 6846
Joined: 2003-11-21 01:09
Location: Sweden
Contact:

Re: Switching from Multibyte to Unicode

Post by martin » 2019-08-13 11:51

Hmm, long time since I looked at this code. There are two string types types in hMailServer- HM::String and HM::AnsiString. HM::String relies on wchar_t internally, while HM::AnsiString relies on char.
https://docs.microsoft.com/en-us/cpp/cpp/char-wchar-t-char16-t-char32-t?view=vs-2019 wrote:The wchar_t type is an implementation-defined wide character type. In the Microsoft compiler, it represents a 16-bit wide character used to store Unicode encoded as UTF-16LE, the native character type on Windows operating systems.
So HM::String which you refer to is in fact not ANSI, but UTF-16LE, right?

hMailServer does use HM::AnsiString (char) as well. The idea with this one was to use it in places where using an Unicode-encoded strings does not make sense. For example, text sent over SMTP/POP3/IMAP won't use a Unicode-encoding. Another example is query functions for MySQL which takes char* and not wchar_t*. If you have an algorithm which wants to scan through a sequence of char's, passing in a wchar_t* may then lead to issues since each actual character may represents >1 byte.

I've actually forgotten parts of this, but to me it looks like it's currently using UTF-16LE for a majority of cases (String) while still using AnsiString. Do I misunderstand you?
Martin Knafve

User avatar
Dravion
Senior user
Senior user
Posts: 2071
Joined: 2015-09-26 11:50
Location: Germany
Contact:

Re: Switching from Multibyte to Unicode

Post by Dravion » 2019-08-13 14:13

martin wrote:
2019-08-13 11:51
Hmm, long time since I looked at this code. There are two string types types in hMailServer- HM::String and HM::AnsiString. HM::String relies on wchar_t internally, while HM::AnsiString relies on char.
https://docs.microsoft.com/en-us/cpp/cpp/char-wchar-t-char16-t-char32-t?view=vs-2019 wrote:The wchar_t type is an implementation-defined wide character type. In the Microsoft compiler, it represents a 16-bit wide character used to store Unicode encoded as UTF-16LE, the native character type on Windows operating systems.
So HM::String which you refer to is in fact not ANSI, but UTF-16LE, right?
As long as we talk about x86 CPU's LE is fine to maintain the Least Significant Byte (LSB) order format.
This would be different talk with non x86 CPU's.

VS let you specify globally, by setting the macro _UNICODE or _MBCS how Datatypes like char or wchar_t is handled.
If your source OR included header file OR your Visual Studio Project Preprocessor is set to _MBCS, its a classic c char datatype.
But if its set to _UNICODE declarations of type char* foo; will automatically treated like w_char* foo;

You can overrule this behavior anytime by using a ANSI or Wide character specific API call in your code, but this makes the code more unreadable
and harder to understand. That's why i think defaulting to _UNICODE in hMailServer's Main VC++ Project file would make sense.

You just have to keep in mind char* is w_char* if you use it in your code.

It also will make the code run faster, because there there ins no typecasting necessary, because the Windows API can process w_char 1:1.
hMailServer does use HM::AnsiString (char) as well. The idea with this one was to use it in places where using an Unicode-encoded strings does not make sense. For example, text sent over SMTP/POP3/IMAP won't use a Unicode-encoding. Another example is query functions for MySQL which takes char* and not wchar_t*. If you have an algorithm which wants to scan through a sequence of char's, passing in a wchar_t* may then lead to issues since each actual character may represents >1 byte.

I've actually forgotten parts of this, but to me it looks like it's currently using UTF-16LE for a majority of cases (String) while still using AnsiString. Do I misunderstand you?
I think this wouldn't stop the show, because it's more an internal compiler specific task behind the curtain how it generates it's Object code before it hits the Linker.

Regarding Connection encoding. I think you can't really rely on encoding promises a client makes in the first place.
It could be a security risk if someone is tampering with unchecked encoding data on a open socket connection.

BeSmart
New user
New user
Posts: 7
Joined: 2019-05-21 10:26

Re: Switching from Multibyte to Unicode

Post by BeSmart » 2019-08-13 22:51

Simply using Unicode for each and everything is certainly not the smartest thing, and there is a good reason to use ANSI for ANSI-only realms. Developing software is not about easy going, but getting the job done right. Beside that, Unicode is already enabled for MFC applications by default.

User avatar
Dravion
Senior user
Senior user
Posts: 2071
Joined: 2015-09-26 11:50
Location: Germany
Contact:

Re: Switching from Multibyte to Unicode

Post by Dravion » 2019-08-13 23:28

BeSmart wrote:
2019-08-13 22:51
Simply using Unicode for each and everything is certainly not the smartest thing, and there is a good reason to use ANSI for ANSI-only realms. Developing software is not about easy going, but getting the job done right. Beside that, Unicode is already enabled for MFC applications by default.
MS want's us to use Unicode for every Project.
It's a wise decision because Windows NT was always and will always be Unicode first.
Regarding String types. You shouldn't cast anything explicit to Multibyte or Unicode before checking if it's safe on Socket Connections.
If you do this type of casting, it's recipe for a Security Nightmare. Someone can inject all sorts of characters trying to trick format
string definitions and passes unchecked char's directly to the stack and trigger a buffer overflow, inject malicious binary shell code as payload and gains LOCAL MACHINE Super user permissions (thats an Alias for NT SYSTEM User account, which has the highest
permission possible on a Windows NT type Windows Operating system).

The only thing an attacker needs is to study hMailServers socket source code, finding a ANSI/UNICODE format string flaw, preparing a
payload with Metasploit and boom. That's why you shouldn't intermix with Multibyte 2 and Widechar 4 Byte sequences in the first place.

Post Reply