Smarty Forum Index Smarty
WARNING: All discussion is moving to https://reddit.com/r/smarty, please go there! This forum will be closing soon.

a few bugs in the "truncate" plugin

 
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies.    Smarty Forum Index -> Bugs
View previous topic :: View next topic  
Author Message
Spuerhund
Smarty Rookie


Joined: 20 Jan 2005
Posts: 16

PostPosted: Mon May 29, 2006 9:37 pm    Post subject: a few bugs in the "truncate" plugin Reply with quote

This is the complete code of the modifier.truncate.php function (copied from the latest Smarty version 2.6.14):

[php:1:6218fff286]function smarty_modifier_truncate($string, $length = 80, $etc = '...',
$break_words = false, $middle = false)
{
if ($length == 0)
return '';

if (strlen($string) > $length) {
$length -= strlen($etc);
if (!$break_words && !$middle) {
$string = preg_replace('/\s+?(\S+)?$/', '', substr($string, 0, $length+1));
}
if(!$middle) {
return substr($string, 0, $length).$etc;
} else {
return substr($string, 0, $length/2) . $etc . substr($string, -$length/2);
}
} else {
return $string;
}
}[/php:1:6218fff286]

There are several things to improve, some people (including me) would rather call them "bugs":

1) It shold be mentioned somewhere that this plugin does NOT work with unicode text! In the manual http://smarty.php.net/manual/en/language.modifier.truncate.php there is no such hint. If you want this code to work correctly with unicode (UTF-8 f. e.) you have to change "strlen" to "mb_strlen" and "substr" to "mb_substr". Finally you should add the "u" modifier to the preg_replace regexp.

mb_* will of course only work if the php mb_string extension is available, that can easily be checked by

Code:
if (extension_loaded('mbstring')) { ... }



2) The regular expression /\s+?(\S+)?$/ is "suboptimal". First "\s+?" can be written as "\s", it will not change anything. Second "(\S+)" gets extracted without ever using it, thus it should be changed to "(?:\S+)" to save memory.

3) The value of the $etc variable gets appended to the string whenever it is longer than $length characters. But this is not always correct, as there should be a single whitespace between the last word and the value of $etc whenever words are not cut (break_words = false). There is a difference in the meaning of "foo bar..." and "foo bar ...", therefore depending on the value of $break_words there should be a whitespace.

4) The default value of $etc is '...' -- bad too. I refer to http://de.wikipedia.org/wiki/Auslassungspunkte which gives a good explanation about this issue. I just recommend to change the default value to "…" (unicode: & # 8 2 3 0 ; ).


Thats all for the moment, i keep on searching. Wink

Spuerhund
Back to top
View user's profile Send private message
boots
Administrator


Joined: 16 Apr 2003
Posts: 5611
Location: Toronto, Canada

PostPosted: Mon May 29, 2006 11:43 pm    Post subject: Re: a few bugs in the "truncate" plugin Reply with quote

Hi and thanks for taking the time to write.

Spuerhund wrote:
1) It shold be mentioned somewhere that this plugin does NOT work with unicode text!
You do know that this is the truth for all of PHP, yes? If you want mb type functions, you have to implement them youself.

Spuerhund wrote:
2) The regular expression /\s+?(\S+)?$/ is "suboptimal". First "\s+?" can be written as "\s", it will not change anything. Second "(\S+)" gets extracted without ever using it, thus it should be changed to "(?:\S+)" to save memory.
"\s+?" != "\s" -- at best, it is the same as \s* (however, it is treated differently by the regex matching)

Spuerhund wrote:
3) The value of the $etc variable gets appended to the string whenever it is longer than $length characters. But this is not always correct, as there should be a single whitespace between the last word and the value of $etc whenever words are not cut (break_words = false). There is a difference in the meaning of "foo bar..." and "foo bar ...", therefore depending on the value of $break_words there should be a whitespace.
I don't really care one way or the other but I disagree with your assessment. They are both continuations and that is all.

Spuerhund wrote:
4) The default value of $etc is '...' -- bad too. I refer to http://de.wikipedia.org/wiki/Auslassungspunkte which gives a good explanation about this issue. I just recommend to change the default value to "…" (unicode: & # 8 2 3 0 ; ).
Sorry, I don't speak german but we don't put unicode text into the default plugins.

I'm open to being shown the error-of-my-ways but my position is that if you want unicode behaviour, you have to build it yourself. When PHP officially supports unicode out-of-the-box, so will we.

Best Regards.
Back to top
View user's profile Send private message
bugmenot
Smarty n00b


Joined: 31 May 2006
Posts: 1

PostPosted: Wed May 31, 2006 10:17 am    Post subject: Reply with quote

"\s+?" does the same as "\s" (at least in this case, because it's not limited at the left).

? does not only mean {0,1}, in cases where a quantifier like + or * is followed by ? it switches the greediness.

And why do you need Unicode support in PHP if you can just write the numeric entity? That could be used even in ASCII. I think it actually should be written as an entity to keep it (almost) independent of the used charset.
Back to top
View user's profile Send private message
boots
Administrator


Joined: 16 Apr 2003
Posts: 5611
Location: Toronto, Canada

PostPosted: Wed May 31, 2006 3:46 pm    Post subject: Reply with quote

Okay, I'll concede the \s+? issue. It really doesn't seem pertinant enough to make a change, though.

As for using an entitiy, I think that is not the way to go. Now I'll refer to the wikipedia:
from the Chicago Manual of Style wrote:
Q. How do I insert an ellipsis in my manuscript? My computer keyboard can do that with a couple of keystrokes. Is this acceptable? Or should I type period + space for all three dots? Should these spaces be nonbreaking spaces?

A. For manuscripts, inserting an ellipsis character is a workable method, but it is not the preferred method. It is easy enough for a publisher to search for this unique character and replace it with the recommended three periods plus two nonbreaking spaces (. . .). But in addition to this extra step, there is also the potential for character-mapping problems (the ellipsis could appear as some other character) across software platforms—an added inconvenience. Moreover, the numeric entity for an ellipsis is not formally defined for standard HTML (and may not work with older browsers). So type three spaced dots, like this . . . or, at the end of a grammatical sentence, like this. . . . If you can, add two nonbreaking spaces to keep the three dots—or the last three of four—from breaking across a line.


http://en.wikipedia.org/wiki/Ellipsis

Now there are obviously a lot of rules that can (or may) be applied here. Truncate seems to be working for me as it is. From my POV, I'd rather see a patch that offered a much fuller implementation of the style rules rather than just a few nitpicks that don't really change anything (eg: the regex's) or that prefer one non-canonical style guideline over another without any particular reason. And yes, entities are problematic. Sometimes people prefer plain-text output rather than HTML.

Finally, please register a real account rather than using a bugmenot account when posting here. Thanks.
Back to top
View user's profile Send private message
White Tiger
Smarty n00b


Joined: 11 Sep 2008
Posts: 2

PostPosted: Thu Sep 11, 2008 7:31 am    Post subject: Reply with quote

I ran into the same truncate with UNICODE problem. I'm aware that this topic is 2 years old and I've found at least 3 others to discuss this question. I would like to reflect on your 'when PHP supports UNICODE out-of-the-box' remark.

- I write my PHP in UTF-8 (the editor codes all .php files in UNICODE). works fine.
- I use MySQL for data storage completely in UTF-8. works perfect.
- I display my HTML in UTF-8 (using <head><meta http-equiv="content-type" content="text/html; charset=utf-8"> </head>). works superb.

Now I do not have to take care about string format anymore: read sg. from MySql into a PHP variable and send it to the HTML. I think this is quite a frequent configuration nowadays.

You are right, that for UTF-8 string manipulation in PHP I have to use mb_ functions but then what truncate is for? The very base of Smarty's philosophy to take ALL of the formatting problems. If I have to do it in PHP then the whole idea is screwed up: I am not able to get rid of formatting problems in PHP.

I do not request or blame anything and I am very thankful to Smarty with much help in my development. But this problem is just disastrous to my multilingual project in the long run. I would very much appreciate the official mb_ versions of string manipulation plugins. Until then I use one of the patches offered here in the forum.

And you really _have to_ make a remark in the documentation that your string manipulation is not UNICODE-ready. Not a shame but a useful information for all the developers.
Back to top
View user's profile Send private message
Notromda
Smarty Rookie


Joined: 30 Aug 2004
Posts: 13

PostPosted: Wed Nov 26, 2008 9:09 pm    Post subject: Be consistant Reply with quote

If the smarty plugins don't support UTF8 because php doesn't, then let's be consistent and offer official _mb_ functions that do the equivalent as PHP does.

I found this to be extremely useful, a mb_truncate filter:

http://www.guyrutenberg.com/2007/12/04/multibyte-string-truncate-modifier-for-smarty-mb_truncate/

While I'd rather have smarty know what encoding is in use and automatically do the right thing, I can live with alternate function names. But the option at least needs to be there for everyone to use.
Back to top
View user's profile Send private message
nothinghood
Smarty Rookie


Joined: 20 Aug 2009
Posts: 5

PostPosted: Fri Aug 21, 2009 7:08 am    Post subject: Reply with quote

The
Code:
return mb_substr($string, 0, $length/2, $charset) . $etc . mb_substr($string, -$length/2, $charset);


part does not work as you expect. In particular in second call to mb_substr $charset is passed as length.

As stated here
http://us3.php.net/manual/en/function.mb-substr.php#77515

this way works
Code:
return mb_substr($string, 0, $length/2, $charset) . $etc . mb_substr($string, -$length/2, mb_strlen($string), $charset);


bye
nh
Back to top
View user's profile Send private message
Display posts from previous:   
This forum is locked: you cannot post, reply to, or edit topics.   This topic is locked: you cannot edit posts or make replies.    Smarty Forum Index -> Bugs All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group
Protected by Anti-Spam ACP