Discussion:
[fpc-pascal] Concatenating CP Strings
Martok
2018-09-14 23:38:33 UTC
Permalink
Hi all,

concatenating codepage strings is documented to be a bit weird:
<http://wiki.freepascal.org/FPC_Unicode_support#String_concatenations>

Knowing this, how does one achieve the following?

- have a string in any dynamic codepage
- append another string (possibly from different CP), or a literal
- have the result in the same dynamic codepage as before

Literally, "transcode the new part and plop it at the end"?

Using AnsiStrings does not work, as the declared CP is CP_ACP, which is not the
dynamic CP, and loss of data is likely. Using RawByteStrings does not work, as
they get converted to CP_ACP regardless of their current dynamic CP, and loss of
data is likely. Insert() does not work, because it doesn't care about characters
at all and just moves bytes.

Doing the entire thing manually with a temp string does work, but such a simple
task can't be that difficult, can it?


Thank you,

Martok



PS:
Also, somewhat related: how compatible are the different widestring managers
supposed to be? Windows doesn't support CP_UTF16(BE) (which really is UCS2 - aka
the MBCS alias of WideString), but fpwidestring has correct handling for it.

_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo
Mattias Gaertner via fpc-pascal
2018-09-15 05:34:37 UTC
Permalink
On Sat, 15 Sep 2018 01:38:33 +0200
Post by Martok
Hi all,
<http://wiki.freepascal.org/FPC_Unicode_support#String_concatenations>
Knowing this, how does one achieve the following?
- have a string in any dynamic codepage
- append another string (possibly from different CP), or a literal
- have the result in the same dynamic codepage as before
Literally, "transcode the new part and plop it at the end"?
Using AnsiStrings does not work, as the declared CP is CP_ACP, which is not the
dynamic CP, and loss of data is likely.
There is a neat RTL function SetCodePage to set the CP of a string.
For your string having the wrong CP use this before concatenating:
SetCodePage(s,RealCP,false);

To have the result in a specific codepage use
SetCodePage(result,NeededCP,true);
Post by Martok
[...]
Also, somewhat related: how compatible are the different widestring managers
supposed to be? Windows doesn't support CP_UTF16(BE) (which really is UCS2 - aka
the MBCS alias of WideString),
Only on ancient Windows it was UCS2. Even on old Win7 it was UTF16. I
don't remember when the transition happened. Sadly, there are still
many Windows applications only supporting UCS2.
Post by Martok
but fpwidestring has correct handling for it.
Mattias
_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/
Martok
2018-09-15 11:55:37 UTC
Permalink
Post by Mattias Gaertner via fpc-pascal
To have the result in a specific codepage use
SetCodePage(result,NeededCP,true);
As I wrote, doing this by hand works, but I don't want to believe somebody
thought that was how it should be. Why would "CP_UTF16 + CP_UTF8(literal) =
cp1252" be the desired outcome?

operator >< (a, b: RawByteString): RawByteString;
var
t: RawByteString;
la, lt: Integer;
begin
t:= b;
SetCodePage(t, StringCodePage(a), True);
la:= Length(a);
lt:= Length(t);
result:= a;
SetLength(Result, la + lt);
Move(t[1], Result[la+1], lt);
end;

With that, one can write "foo:= bar >< x" and it just works.
Post by Mattias Gaertner via fpc-pascal
Only on ancient Windows it was UCS2.
In that case, fpwidestring is wrong as well, see fpwidestring.pp:262.

MSDN is slightly unclear:
"1200 utf-16
Unicode UTF-16, little endian byte order (BMP of ISO 10646);
available only to managed applications"

The "managed applications" part is why WideCharToMultiByte simply returns an
empty string when asked to convert anything to cp1200, instead of just doing the
plain memcpy.

"only the BMP" would be UCS2. In other places, surrogate pairs are mentioned,
making it a true UTF encoding.

In any case, I think the RTL should be consistent across platforms?

--
Regards,
Martok



_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo
Martok
2018-09-15 20:12:52 UTC
Permalink
And another one:

var
f: TextFile;
s: string;
begin
AssignFile(f, 'a_file.txt');
SetTextCodePage(f, 866);
Reset(f);
ReadLn(f, s);
WriteLn(StringCodePage(s));
readln;
end.

That is rather useless...


Writing anything into the specified codepage works perfectly fine.
--
Regards,
Martok

_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/li
Martok
2018-09-15 20:32:17 UTC
Permalink
Gah, accidentally removed the comment that said what the actual problem is ;-)
Post by Martok
ReadLn(f, s);
WriteLn(StringCodePage(s));
That prints 1252, which is the DefaultSystemCodePage. At that point, information
loss has already occured, there is no way to fix the CP in user code.
I would expect reading from a file whose codepage I have just set to return
strings in that codepage. Instead, I get the declared codepage of the string.
--
Regards,
Martok


_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http:/
Jonas Maebe
2018-09-15 20:35:18 UTC
Permalink
Post by Martok
Gah, accidentally removed the comment that said what the actual problem is ;-)
Post by Martok
ReadLn(f, s);
WriteLn(StringCodePage(s));
That prints 1252, which is the DefaultSystemCodePage. At that point, information
loss has already occured, there is no way to fix the CP in user code.
I would expect reading from a file whose codepage I have just set to return
strings in that codepage. Instead, I get the declared codepage of the string.
Setting the code page of a file tells the RTL about the encoding of the
strings in the file. The string's static code page (which maps to
DefaultSystemCodePage if none is specified) tells the compiler to which
encoding this string data should be converted.


Jonas
_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fp
Martok
2018-09-15 22:03:33 UTC
Permalink
Post by Jonas Maebe
Setting the code page of a file tells the RTL about the encoding of the
strings in the file. The string's static code page (which maps to
DefaultSystemCodePage if none is specified) tells the compiler to which
encoding this string data should be converted.
I know!

That doesn't make it any more *useful*.
--
Regards,
Martok

_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http://lists
Sven Barth via fpc-pascal
2018-09-16 10:35:36 UTC
Permalink
Post by Martok
var
f: TextFile;
s: string;
begin
AssignFile(f, 'a_file.txt');
SetTextCodePage(f, 866);
Reset(f);
ReadLn(f, s);
WriteLn(StringCodePage(s));
readln;
end.
That is rather useless...
No, it is not. The code page of "String" is CP_ACP, so that means that
every string that gets assigned to it gets converted to the code page
that was at startup of the RTL determined to be the system code page (or
whatever was set using SetMultiByte*CodePage() before the assignment).

If you want the content to *be* in code page 866 without any tricks then
you need to declare a AnsiString with that code page and use that:

=== code begin ===

type
TCP866String = type AnsiString(866);

var
f: TextFile;
s: TCP866String;
begin
Assign(f, 'a_file.txt');
SetTextCodePage(f, 866);
Reset(f);
ReadLn(f, s);
WriteLn(StringCodePage(s));
ReadLn;
end.

=== code end ===

This will print "866" and no conversions of the file will have been
done. Alternatively you can use TUTF8String (which is a "type
AnsiString(CP_UTF8)") and it will print 65001 with a lossless conversion
to UTF-8 having occurred.

TL;DR: "AnsiString"/"String" is a type that has the code page that was
determined at startup, not one that turns itself into whatever code page
gets thrown at it

(Note: there are situations where the static code page of the string and
the dynamic don't match, but they're seldom and exceptions from the rule)

Regards,
Sven
_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-
Martok
2018-09-16 11:31:38 UTC
Permalink
Post by Sven Barth via fpc-pascal
If you want the content to *be* in code page 866 without any tricks then
=== code begin ===
type
TCP866String = type AnsiString(866);
That only works if the codepage is known at compile time.

Let's say the user directs a program to "treat this file as $codepage".
Therefore, I need to read it as this codepage and fill internal data structures
with strings in that codepage, while keeping other operations in the system
codepage (so I can't just change DefaultSystemCodepage). Does that mean that
there is no way to do this with native strings?
Post by Sven Barth via fpc-pascal
TL;DR: "AnsiString"/"String" is a type that has the code page that was
determined at startup, not one that turns itself into whatever code page
gets thrown at it
Actually, there is a String type that is just that (at least according to the
wiki): RawByteString. Supposedly, it just accepts any dynamic codepage without
conversion. But it doesn't work for either of the cases here?

--
Regards,
Martok

_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo
Jonas Maebe
2018-09-16 12:31:06 UTC
Permalink
Post by Martok
Let's say the user directs a program to "treat this file as $codepage".
Therefore, I need to read it as this codepage and fill internal data structures
with strings in that codepage, while keeping other operations in the system
codepage (so I can't just change DefaultSystemCodepage). Does that mean that
there is no way to do this with native strings?
I can only second guess the ideas behind Embarcadero's introduction of
the codepage-aware ansistrings, but I think the main purpose was to make
it easier to convert existing code that was written for a particular
system codepage into a program that works with unicodestring.

Hence, the codepage-aware string functionality supports setting the
wanted code page at the input and output level, and everything in
between is expected to be performed using either unicodestring or
DefaultSystemCodePage. FPC slightly extended this so that the encoding
of system file names (and the code page returned by related routines)
can be specified differently, so that you can easily set
DefaultSystemCodePage to CP_UTF8 (or something else) regardless of what
codepage the system's APIs used by the RTL expect.

In general, a program will seldom have built-in support for analysing
and manipulating strings in every possible codepage in existence. The
general paradigm is to convert a string to a single encoding that is
used internally, perform analysis/processing using this incoding, and
then convert it back. In fact, that is what many runtime library
routines also do because most OS library functions only support a very
select number of codepages (or the OS library functions do it themselves
interanlly).

If you don't care about the codepage and won't perform any processing
that depends on the code page, then codepage-aware strings are probably
the wrong data structure. Arrays may be more appropriate.

Alternatively, you can set the codepage of your text file to
DefaultSystemCodePage and read a regular ansistring from it. You can
still force the code page of the string you read afterwards to something
else using SetStringCodePage() if you wish to use the equivalent of an
explicit typecast at the string codepage level. But indeed, this is not
a workflow that the codepage-aware strings support without extra work on
your part, and as explained above, I don't think it was ever intended to
be either.
Post by Martok
Post by Sven Barth via fpc-pascal
TL;DR: "AnsiString"/"String" is a type that has the code page that was
determined at startup, not one that turns itself into whatever code page
gets thrown at it
Actually, there is a String type that is just that (at least according to the
wiki): RawByteString. Supposedly, it just accepts any dynamic codepage without
conversion. But it doesn't work for either of the cases here?
RawByteString is something that is largely undocumented by Embarcadero.
I tried my best to make the behaviour as compatible as possible with
Delphi, but there are still bugs in it (and holes in my knowledge about
how exactly they are supposed to behave in all possible situations).


Jonas
_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org
Martok
2018-09-16 13:48:49 UTC
Permalink
[...snip...]
Thank you very much for this explanation! One for the bookmarks.

It just seems very odd to me to have the incredibly powerful and useful dynamic
codepage concept... and then trash it on every assignment.
But if that was an Emba-invention, that explains a few things...


Concrete example: the MS RC script format allows changing the input codepage at
runtime (#pragma code_page), meaning the next #include must be interpreted in
that CP, and output generated from this input must be in that CP (unless it is
written in widestring format, but that is decided very late in the process).
This would be super easy to do if one could just pass around a string with the
correct dynamic code page (one could even use the CP_UTF16 codepage for L"foo"
widestrings). But as fcl-res uses AnsiStrings everywhere, this cannot work, as
the only lossless setting would be to use DefaultSystemCodePage=UTF8 for the
entire program, completely ignoring the user, which might cause MUI problems.

Windres gets this wrong as well, but somehow that doesn't really make me feel
any better ;-)

--
Regards,
Martok

_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fp

Sven Barth via fpc-pascal
2018-09-16 10:49:48 UTC
Permalink
Post by Martok
Hi all,
<http://wiki.freepascal.org/FPC_Unicode_support#String_concatenations>
Knowing this, how does one achieve the following?
- have a string in any dynamic codepage
- append another string (possibly from different CP), or a literal
- have the result in the same dynamic codepage as before
Literally, "transcode the new part and plop it at the end"?
Using AnsiStrings does not work, as the declared CP is CP_ACP, which is not the
dynamic CP, and loss of data is likely. Using RawByteStrings does not work, as
they get converted to CP_ACP regardless of their current dynamic CP, and loss of
data is likely. Insert() does not work, because it doesn't care about characters
at all and just moves bytes.
Doing the entire thing manually with a temp string does work, but such a simple
task can't be that difficult, can it?
Safest and cleanest would probably be something like this:

=== code begin ===

function ConcatWithCP(const aLeft, aRight: RawByteString; aCP: LongInt):
RawByteString; inline;
begin
Result := aLeft;
SetCodePage(Result, CP_UTF8, True);
Result := Result + aRight;
SetCodePage(Result, aCP, True);
end;

=== code end ===

(alternatively with an array of RawByteString to concatenate multiple
strings at once)

Regards,
Sven
_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mai
Loading...