Discussion:
[fpc-pascal] Split stream into words
Michael Van Canneyt
2018-07-03 08:31:43 UTC
Permalink
Hi,

What's the easiest way to split a stream into words ?
Words are just that: words, but - here is the caveat - they must support unicode.
So Michael and Michaël are both words.

Tried regexpr unit (the obvious choice), but that does not seem to do the trick:

{$mode objfpc}
{$H+}
uses cwstring, sysutils, classes, regexpr;

Var
Split : TStringList;
S : String;
R : TRegexpr;

begin
Split:=TStringList.Create;
Split.LoadFromFile(ParamStr(1));
S:=Split.Text;
Split.Clear;
r := TRegExpr.Create;
try
r.Expression :='[\w]+';
r.Split (S, Split);
for S in Split do
Writeln('Found: ',S);
finally
r.Free;
end;
end.

Prints simply nonsense...

Michael.
Michael Van Canneyt
2018-07-03 10:04:32 UTC
Permalink
Post by Michael Van Canneyt
Hi,
What's the easiest way to split a stream into words ?
Words are just that: words, but - here is the caveat - they must support unicode.
So Michael and Michaël are both words.
Correction, regexp can handle it if you compile for unicode, and use the
correct regexp...

Michael.
Marco van de Voort
2018-07-03 10:05:07 UTC
Permalink
Post by Michael Van Canneyt
What's the easiest way to split a stream into words ?
Doesn't strutils have some word extraction and count functions?

_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pasca
Michael Van Canneyt
2018-07-03 10:13:36 UTC
Permalink
Post by Marco van de Voort
Post by Michael Van Canneyt
What's the easiest way to split a stream into words ?
Doesn't strutils have some word extraction and count functions?
It does: WordCount,ExtractWord, but they are very inefficent.

Michael.
_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/list
Marco van de Voort
2018-07-03 10:46:05 UTC
Permalink
Post by Michael Van Canneyt
Post by Marco van de Voort
Post by Michael Van Canneyt
What's the easiest way to split a stream into words ?
Doesn't strutils have some word extraction and count functions?
It does: WordCount,ExtractWord, but they are very inefficent.
function splitstring(const s:string;c:char):TStringList;

var i,i2,j : integer;
x : string;
begin
result:=TStringlist.create;
i:=0;
repeat
j:=PosEx(c,s,i+1);
i2:=j;
if i2=0 then i2:=length(s)+1;
x:=trim(copy(s,i+1,i2-i-1));
result.add(x);
i:=j;
until j=0;
end;

Afaik I also must have a variant with posset somewhere. In another variant
I use a class around a array of string, which keeps a count of valid
entries. This avoids setlengths on repeated use. All fairly trivial.
_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mai
Michael Van Canneyt
2018-07-03 10:50:39 UTC
Permalink
Post by Marco van de Voort
Post by Michael Van Canneyt
Post by Marco van de Voort
Post by Michael Van Canneyt
What's the easiest way to split a stream into words ?
Doesn't strutils have some word extraction and count functions?
It does: WordCount,ExtractWord, but they are very inefficent.
function splitstring(const s:string;c:char):TStringList;
var i,i2,j : integer;
x : string;
begin
result:=TStringlist.create;
i:=0;
repeat
j:=PosEx(c,s,i+1);
i2:=j;
if i2=0 then i2:=length(s)+1;
x:=trim(copy(s,i+1,i2-i-1));
result.add(x);
i:=j;
until j=0;
end;
Afaik I also must have a variant with posset somewhere. In another variant
I use a class around a array of string, which keeps a count of valid
entries. This avoids setlengths on repeated use. All fairly trivial.
Trivial indeed, till you need more fine-grained control.
e.g. C needs to be an array of chars that mark word boundaries etc.

But I managed to solve the problem with regexps...

Michael.
_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/list
Marcos Douglas B. Santos
2018-07-03 13:21:38 UTC
Permalink
On Tue, Jul 3, 2018 at 7:50 AM, Michael Van Canneyt
Post by Michael Van Canneyt
Trivial indeed, till you need more fine-grained control.
e.g. C needs to be an array of chars that mark word boundaries etc.
But I managed to solve the problem with regexps...
How?
_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/
Michael Van Canneyt
2018-07-03 13:26:14 UTC
Permalink
Post by Marcos Douglas B. Santos
On Tue, Jul 3, 2018 at 7:50 AM, Michael Van Canneyt
Post by Michael Van Canneyt
Trivial indeed, till you need more fine-grained control.
e.g. C needs to be an array of chars that mark word boundaries etc.
But I managed to solve the problem with regexps...
How?
I misunderstood how Split works. The regex is the 'word separator' in that
function.

The following correctly gives me all words. unit uregexp is the regexp unit
compiled for unicode.

Michael.

--------------

{$mode objfpc}
{$H+}
uses cwstring, sysutils, classes, uregexpr;

Var
Split : TStringList;
S : String;
R : TRegexpr;
E : TEncoding;

begin
Split:=TStringList.Create;
E:=TEncoding.UTF8;
Split.LoadFromFile(ParamStr(1),E);
S:=Split.Text;
r := TRegExpr.Create;
try
r.spaceChars:=r.spaceChars+'|&@#"''(§^!{})-[]*%`=+/.;:,?';
r.LineSeparators:=#10;
r.Expression :='(\b[^\d\s]+\b)';
if R.Exec(S) then
REPEAT
Writeln('Found: ',System.Copy (S, R.MatchPos [0], R.MatchLen[0]));
UNTIL not R.ExecNext;
finally
r.Free;
end;
end.
Marcos Douglas B. Santos
2018-07-03 13:42:53 UTC
Permalink
On Tue, Jul 3, 2018 at 10:26 AM, Michael Van Canneyt
Post by Michael Van Canneyt
Post by Marcos Douglas B. Santos
On Tue, Jul 3, 2018 at 7:50 AM, Michael Van Canneyt
Post by Michael Van Canneyt
Trivial indeed, till you need more fine-grained control.
e.g. C needs to be an array of chars that mark word boundaries etc.
But I managed to solve the problem with regexps...
How?
I misunderstood how Split works. The regex is the 'word separator' in that
function.
The following correctly gives me all words. unit uregexp is the regexp unit
compiled for unicode.
Thanks.
But, is uregexp part of FPC?

Marcos Douglas
_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-p
Michael Van Canneyt
2018-07-03 14:55:25 UTC
Permalink
Post by Marcos Douglas B. Santos
On Tue, Jul 3, 2018 at 10:26 AM, Michael Van Canneyt
Post by Michael Van Canneyt
Post by Marcos Douglas B. Santos
On Tue, Jul 3, 2018 at 7:50 AM, Michael Van Canneyt
Post by Michael Van Canneyt
Trivial indeed, till you need more fine-grained control.
e.g. C needs to be an array of chars that mark word boundaries etc.
But I managed to solve the problem with regexps...
How?
I misunderstood how Split works. The regex is the 'word separator' in that
function.
The following correctly gives me all words. unit uregexp is the regexp unit
compiled for unicode.
Thanks.
But, is uregexp part of FPC?
Not yet, but I intend to make it so.

Michael.
_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/
Marcos Douglas B. Santos
2018-07-03 17:26:50 UTC
Permalink
On Tue, Jul 3, 2018 at 11:55 AM, Michael Van Canneyt
Post by Michael Van Canneyt
Post by Marcos Douglas B. Santos
Thanks.
But, is uregexp part of FPC?
Not yet, but I intend to make it so.
All right! Thanks.

Marcos Douglas

PS. Please, don't forget the XPath Unicode implementation too.
We have talked about it months ago... but I can imagine that your
to-do list is huge.
_______________________________________________
fpc-pascal maillist - fpc-***@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listin

Loading...