View previous topic :: View next topic |
Author |
Message |
jrbeaure Smarty n00b
Joined: 23 May 2014 Posts: 4
|
Posted: Tue Dec 09, 2014 6:22 pm Post subject: Regarding smarty_mb_str_replace |
|
|
What are the character encodings that Smarty should work with?
1. The str_replace function covers single-byte character encodings as well as UTF-8, and probably the other UTF encodings. (UTF schemes take advantage of the fact that ASCII was 7-bit in order to maintain backwards compatibility in this regard).
2. An example of a multi-byte character encoding that str_replace would not always work with is Shift-JIS. However, I can't find a concrete example, especially since I'm not Japanese.
3. The smarty_mb_str_replace function is using mb_split, which relies on the character encoding specified by mb_regex_encoding. Given that, I don't see how Smarty's mb_str_replace implementation is encoding-aware.
Why not just use vanilla str_replace? |
|
Back to top |
|
AnrDaemon Administrator
Joined: 03 Dec 2012 Posts: 1785
|
Posted: Tue Dec 09, 2014 11:44 pm Post subject: |
|
|
Smarty has been built with UTF-8 in mind. |
|
Back to top |
|
U.Tews Administrator
Joined: 22 Nov 2006 Posts: 5068 Location: Hamburg / Germany
|
Posted: Wed Dec 10, 2014 12:54 am Post subject: |
|
|
You can set Smarty's charset
Code: | Smarty::$_CHARSET = 'UTF-8'; |
which is the default. |
|
Back to top |
|
jrbeaure Smarty n00b
Joined: 23 May 2014 Posts: 4
|
Posted: Thu Dec 11, 2014 11:50 pm Post subject: |
|
|
Okay, so regarding str_replace compatibility with UTF-8:
One byte characters are encoded as 0xxxxxxx. This is backwards-compatible with ASCII.
Two byte characters are encoded as 110xxxxx 10xxxxxx. The byte-sequence does not overlap with one byte characters
Three byte characters are encoded as 1110xxxx 10xxxxxx 10xxxxxx. The byte sequence does not overlap with either one-byte characters or two-byte characters (and so on).
In terms of UTF-8, this makes the function smarty_mb_str_replace pretty useless when the code could be using the faster str_replace function instead. As long as the developer isn't trying to mix character sets, there is no issue in using vanilla str_replace for UTF-8 support.
Additionally, modifiers that include a reference to smarty_mb_str_replace don't have a compiler version. That means for every reference to the replace modifier, at least one system call has to be made (for the include/require statement), and in some environments system calls can have a high cost in terms of efficiency (what made me look into this in the first place).
Edit: Basically what I'm trying to say here, is that I'd like to see smarty_mb_str_replace removed from the codebase and create compiler versions for modifiers such as the replace modifier. |
|
Back to top |
|
U.Tews Administrator
Joined: 22 Nov 2006 Posts: 5068 Location: Hamburg / Germany
|
|
Back to top |
|
jrbeaure Smarty n00b
Joined: 23 May 2014 Posts: 4
|
Posted: Fri Dec 12, 2014 6:23 pm Post subject: |
|
|
Ensuring the same character encoding (i.e. validating input) throughout is important (otherwise the code is inherently flawed), and there's a function for that which helps prevent invalid encoding attacks such as the link you provided describes.
http://php.net/manual/en/function.mb-check-encoding.php
Are you sure that smarty_mb_str_replace is the right solution? Is there a reason not to validate input when variables are being assigned to the template? I'm pretty sure that's been the general library solution for preventing SQL injection attacks (except in those instances they're "prepared statements" and not "templates"). |
|
Back to top |
|
AnrDaemon Administrator
Joined: 03 Dec 2012 Posts: 1785
|
Posted: Fri Dec 12, 2014 10:58 pm Post subject: |
|
|
The whole set of mb_* functions is one-sided and inherently flawed. I don't know, how you can call them "safer". The sole reason everyone's using them is because there seems to be no alternative.
From my experience, there's just no one reliable set of functions to work with all diverse encodings. You always need something from that other lib over there, in the end, your code is turned into an ugly pile of crap. |
|
Back to top |
|
jrbeaure Smarty n00b
Joined: 23 May 2014 Posts: 4
|
Posted: Tue Dec 16, 2014 8:17 pm Post subject: |
|
|
AnrDaemon wrote: | The whole set of mb_* functions is one-sided and inherently flawed. I don't know, how you can call them "safer". The sole reason everyone's using them is because there seems to be no alternative.
From my experience, there's just no one reliable set of functions to work with all diverse encodings. You always need something from that other lib over there, in the end, your code is turned into an ugly pile of crap. |
They work from a functionality standpoint. E.g. mb_strpos provides you character position rather than byte position, mb_convert_case works on all bicameral alphabetic characters (which include Latin, Greek, and Cyrillic alphabets), etc.
In this instance I'm talking about a performance enhancement. The counter-argument is a safety issue (invalid encoding attack). The counter-counter-argument is that mb_check_encoding can be used at a higher level to prevent that attack outright rather than it leaking into other functionality. |
|
Back to top |
|
AnrDaemon Administrator
Joined: 03 Dec 2012 Posts: 1785
|
Posted: Tue Dec 16, 2014 9:01 pm Post subject: |
|
|
mb_check_encoding can be used to ensure data safety. Just don't make a mistake of using mb_detect_encoding. THAT one is pure evil. |
|
Back to top |
|
|