Bug 205

Summary: Can't open banks with non-ascii characters in path
Product: jlscp Reporter: Nikita Zlobin <cook60020tmp>
Component: jlscpAssignee: Grigor Iliev <gr.iliev>
Status: ASSIGNED ---    
Severity: major CC: cuse
Priority: P5    
Version: SVN Trunk   
Hardware: PC   
OS: Linux   
Attachments: jlscp-fix-utf8-escaping.diff

Description Nikita Zlobin 2013-12-25 13:45:14 CET
This problem appears after bank is selected. I.e., in file chooser all russian names are russian. But when engine tries to open it, it appears that russian chars are replaced with question marks.

My fortune that i store all banks and even samples in ~/sounds instead of some standard desktop path, usually translated to user's locale.
Comment 1 Christian Schoenebeck 2021-05-12 13:54:41 CEST
Would you please try if it works for you when using QSampler instead of JSampler/Fantasia?
Comment 2 Nikita Zlobin 2021-05-13 00:57:52 CEST
For now I have linuxsampler-2.1.0.svn8, qsampler 9.1 and jsampler 0.9.
I have to note, that both qsampler and jsampler fail to load by filename. But when I tried in lscp shell, it was able to load it with cyrillic in name. Though command GET CHANNEL INFO 0 displays non-ascii in esc codes rather than utf8 text.

I tried to rebuild linuxsampler stuff from fresh (I use live ebuilds in gentoo), but my gcc is old by now (7.4.0), and it says, that min C++14 is required. Will look if I can update qsampler without linuxsampler update.
Comment 3 Nikita Zlobin 2021-05-13 09:04:47 CEST
More recent info.
It's not just instrument file path passing, but about all non-ascii communication with linuxsampler. For qsampler it's visible in channel creation dialog, where device names have non-ascii chars (all russian) displayed as \xXX.

It's possible to pass utf8 both in directly typed or pasted form and and replacing non-ascii with C-like sequences (still in quotes of course).

For example.
gig file: "/home/nick87720z/Музыка/Yamaha C7.gig"

lscp shell input:

CREATE AUDIO_OUTPUT_DEVICE JACK
CREATE MIDI_INPUT_DEVICE ALSA
ADD CHANNEL
LOAD INSTRUMENT NON_MODAL "/home/nick87720z/Музыка/Yamaha C7.gig" 0 0

Now:
lscp=# GET CHANNEL INFO 0
ENGINE_NAME: GIG
VOLUME: 1.000
AUDIO_OUTPUT_DEVICE: 0
AUDIO_OUTPUT_CHANNELS: 2
AUDIO_OUTPUT_ROUTING: 0,1
MIDI_INPUT_DEVICE: 0
MIDI_INPUT_PORT: 0
MIDI_INPUT_CHANNEL: 0
INSTRUMENT_FILE: /home/nick87720z/\xd0\x9c\xd1\x83\xd0\xb7\xd1\x8b\xd0\xba\xd0\xb0/Yamaha\x20C7.gig
INSTRUMENT_NR: 0
INSTRUMENT_NAME: A\x27 Yamaha C7  \x2716 (Up+ Rel)
INSTRUMENT_STATUS: 100
MUTE: false
SOLO: false
MIDI_INSTRUMENT_MAP: NONE

This sequence is correct.
Following instrument command is ok as well:
LOAD INSTRUMENT NON_MODAL "/home/nick87720z/\xd0\x9c\xd1\x83\xd0\xb7\xd1\x8b\xd0\xba\xd0\xb0/Yamaha\x20C7.gig" 0 0

Now about frontends - right to attempt to load instrument, linuxsampler logs.
JSampler:

Scheduling '/home/nick87720z/C7K:0/Yamaha C7.gig' (Index=0) to be loaded in background (if not loaded yet).
Loading gig file '/home/nick87720z/C7K:0/Yamaha C7.gig'...gig::Engine error: Failed to load instrument, cause: Can't open "/home/nick87720z/C7K:0/Yamaha C7.gig": No such file or directory

QSampler:

Scheduling '/home/nick87720z/' (Index=0) to be loaded in background (if not loaded yet).
Loading gig file '/home/nick87720z/'...gig::Engine error: Failed to load instrument, cause: Not a RIFF file

I worry, is not it really possible to make lscp communications with different encodings? UTF8 is now seem to be standard at least for terminals. Moreover, ascii compatible. If making it default still may break existing clients, there still could be LSCP command to set encoding.

Faster way would be to change gui to interpret \xXX sequences in both directions.
Comment 4 Nikita Zlobin 2021-05-13 09:09:49 CEST
Also about lscp shell behavior when pasting russian text.
utf8 chars are disaplayed by terminal correctly, but cursor shifts by multiple possitions, obviously counting bytes. And backspace erases bytes instead entire utf8 sequences.
Comment 5 Nikita Zlobin 2021-05-13 09:23:07 CEST
Last attempt was with fresh qsampler and liblscp from git.
Comment 6 Nikita Zlobin 2021-05-13 19:00:24 CEST
I guess, I'm ready to try fix it. From first look it's already supposed to support escaping and pass utf8 to LS.
Comment 7 Nikita Zlobin 2021-05-14 08:41:42 CEST
Although problem is wider, this is enough to fix loading:
https://gitlab.com/rncbc/qsampler/-/merge_requests/1
Comment 8 Nikita Zlobin 2021-05-14 19:17:35 CEST
It seems, linuxsampler accepts utf8 without problems. Test with lscp shell is not just one example. Qsampler code uses LscpEscapePath() in two liblscp calls: lscp_load_instrument_non_modal() and lscp_map_midi_instrument() right inside argument. Although only first case is our there, I edited both in such way (from git diff output):

diff --git a/src/qsamplerChannel.cpp b/src/qsamplerChannel.cpp
index 1a5c8bf..49326e3 100644
--- a/src/qsamplerChannel.cpp
+++ b/src/qsamplerChannel.cpp
@@ -224,8 +224,7 @@ bool Channel::loadInstrument ( const QString& sInstrumentFile, int iInstrumentNr
 
 	if (::lscp_load_instrument_non_modal(
 			pMainForm->client(),
-			qsamplerUtilities::lscpEscapePath(
-				sInstrumentFile).toUtf8().constData(),
+			sInstrumentFile.toUtf8().constData(),
 			iInstrumentNr, m_iChannelID
 		) != LSCP_OK) {
 		appendMessagesClient("lscp_load_instrument");
diff --git a/src/qsamplerInstrument.cpp b/src/qsamplerInstrument.cpp
index 7beb109..c0bfb21 100644
--- a/src/qsamplerInstrument.cpp
+++ b/src/qsamplerInstrument.cpp
@@ -196,8 +196,8 @@ bool Instrument::mapInstrument (void)
 
 	if (::lscp_map_midi_instrument(pMainForm->client(), &instr,
 			m_sEngineName.toUtf8().constData(),
-			qsamplerUtilities::lscpEscapePath(
-				m_sInstrumentFile).toUtf8().constData(),
+			
+			m_sInstrumentFile.toUtf8().constData(),
 			m_iInstrumentNr, m_fVolume, load_mode,
 			m_sName.toUtf8().constData()) != LSCP_OK) {
 		pMainForm->appendMessagesClient("lscp_map_midi_instrument");

Note - when LscpEscapePath() call is in place, toUtf8() could be replaced by e.g. toLatin1() without breakages (I guess). But even without escaping - it just works. Tracing down to liblscp code, I found no more conversions until send() call (sys/socket.h). Escaping requirement is from LSCP 1.2 as told in comments from LscpEscapePath() code.
Comment 9 Nikita Zlobin 2021-05-16 09:41:20 CEST
I forgot this bug was about jsampler, not qsampler. Instead of fixing jsampler,I was doing fix for Q. Merge request above now fixes all qsampler communications, just waiting for rncbc attention.
Comment 10 Nikita Zlobin 2021-05-16 16:25:31 CEST
I'm trying to analyse jsampler and jlscp to find, where (un)escaping happens. For now it seems like it's defined in jlscp (Parser.java). Still can't find, how it translates binary to lscp escapes. My first suggestion. UTF16 is standard string encoding in java. It's better be translated to system encoding (or what is utf8 in linux now) _before_ any attempt to translate. In Qt QString::toUtf8() method produces QByteArray object. Could be same way in java, i'm still digging (my first attempt to dig java project).
Comment 11 Nikita Zlobin 2021-05-16 16:45:44 CEST
I see number of .getBytes("US-ASCII") calls both in jsampler and jlscp. It may be really goot to use it like getBytes() without arg before escaping.
Comment 12 Christian Schoenebeck 2021-11-26 16:51:36 CET
I see you have been working on a fix for QSampler on the same issue (bug #314), thanks!

Do you have plans for a fix for this issue on Fantasia/JSampler as well? If not, I will still leave this report open for some time, but will eventually close it as WONTFIX then, as there was nobody actively working on Fantasia/JSampler for many years now.
Comment 13 Nikita Zlobin 2021-11-26 17:00:49 CET
I really tried, but I have no enough experience coding java, as I never really coded it. I only have little of it from university lessons. I only understood from first look - it's different from qsampler. I guess, I could try in some future unless someone else fixes it. For now its most precise, what I can tell.
Comment 14 Christian Schoenebeck 2021-12-01 15:46:48 CET
Ok, I understand Nikita.

For anyone that might be interested in looking at this issue: Java does not seem to have built-in translation of escape sequences (e.g. in a convenient way with String.getBytes(...)). So probably best way to handle this in JSampler/Fantasia would be to use a regular expression to translate the incoming data from the sampler and replacing all occurrences of escape sequences by respective unicode characters.

Then for the other way around, i.e. converting from a Unicode string from JSampler/Fantasia to be sent out to the sampler: maybe just converting the Java String object into a Character object array, which can deal with unicode for each character:

String s = ... ;
...
Character[] charObjectArray = ArrayUtils.toObject(s.toCharArray());

and finally assembling a Java string from that array where each unicode character is replaced by an escape sequence instead.
Comment 15 Nikita Zlobin 2021-12-02 12:10:01 CET
Tried once more, this time it's much simpler - just utilized eclipse IDE.
The problem is really not different from one in qsampler.
Java String class is locked to UTF-16 by design, and StringBuffer seems to have same encoding, with only difference in that it's not immutable.

This confusion seems to be more common - I found article, which assumes, that String object can be recreated in different encoding via intermediate byte[] array (according to docs, charset argument in String(byte[], charset) means incoming encoding, while String can't be other than UTF_16).

For now I got ok input parsing (from server), there's svn diff output:
(the real conversion code is in jlscp, not jsampler itself)
=================================================
Index: src/org/linuxsampler/lscp/Parser.java
===================================================================
--- src/org/linuxsampler/lscp/Parser.java	(revision 3905)
+++ src/org/linuxsampler/lscp/Parser.java	(working copy)
@@ -617,7 +617,8 @@
 	public static String
 	toNonEscapedString(Object obj) {
 		String s = obj.toString();
-		StringBuffer sb = new StringBuffer();
+		byte[] sb = new byte[s.length() + 1];
+		int j = 0;
 		for(int i = 0; i < s.length(); i++) {
 			char c = s.charAt(i);
 			if(c == '\\') {
@@ -626,34 +627,34 @@
 					break;
 				}
 				char c2 = s.charAt(++i);
-				if(c2 == '\'')      sb.append('\'');
-				else if(c2 == '"')  sb.append('"');
-				else if(c2 == '\\') sb.append('\\');
-				else if(c2 == 'r')  sb.append('\r');
-				else if(c2 == 'n')  sb.append('\n');
-				else if(c2 == 'f')  sb.append('\f');
-				else if(c2 == 't')  sb.append('\t');
-				else if(c2 == 'v')  sb.append((char)0x0B);
+				if     (c2 == '\'') sb[j++] = '\''; 
+				else if(c2 == '"')  sb[j++] = '"';
+				else if(c2 == '\\') sb[j++] = '\\';
+				else if(c2 == 'r')  sb[j++] = '\r';
+				else if(c2 == 'n')  sb[j++] = '\n';
+				else if(c2 == 'f')  sb[j++] = '\f';
+				else if(c2 == 't')  sb[j++] = '\t';
+				else if(c2 == 'v')  sb[j++] = (char)0x0B;
 				else if(c2 == 'x') {
-					Character ch = getHexEscapeSequence(s, i + 1);
-					if(ch != null) sb.append(ch.charValue());
+					byte ch = getHexEscapeSequence(s, i + 1);
+					if(ch != 0) sb[j++] = ch;
 					i += 2;
 				} else if(c2 >= '0' && c2 <= '9') {
-					Character ch = getOctEscapeSequence(s, i);
-					if(ch != null) sb.append(ch.charValue());
+					byte ch = getOctEscapeSequence(s, i);
+					if(ch != 0) sb[j++] = ch;
 					i += 2;
 				} else Client.getLogger().info("Unknown escape sequence \\" + c2);
 			} else {
-				sb.append(c);
+				sb[j++] = (byte)c;
 			}
 		}
-		
-		return sb.toString();
+		sb[j] = 0;
+		return new String(sb, java.nio.charset.StandardCharsets.UTF_8);
 	}
 	
-	private static Character
+	private static byte
 	getHexEscapeSequence(String s, int index) {
-		Character c = null;
+		byte c = 0;
 		
 		if(index + 1 >= s.length()) {
 			Client.getLogger().info("Broken escape sequence");
@@ -660,15 +661,15 @@
 			return c;
 		}
 		
-		try { c = (char)Integer.parseInt(s.substring(index, index + 2), 16); }
+		try { c = (byte)Integer.parseInt(s.substring(index, index + 2), 16); }
 		catch(Exception x) { Client.getLogger().info("Broken escape sequence!"); }
 		
 		return c;
 	}
 	
-	private static Character
+	private static byte
 	getOctEscapeSequence(String s, int index) {
-		Character c = null;
+		byte c = 0;
 		
 		if(index + 2 >= s.length()) {
 			Client.getLogger().info("Broken escape sequence");
@@ -675,7 +676,7 @@
 			return c;
 		}
 		
-		try { c = (char)Integer.parseInt(s.substring(index, index + 3), 8); }
+		try { c = (byte)Integer.parseInt(s.substring(index, index + 3), 8); }
 		catch(Exception x) { Client.getLogger().info("Broken escape sequence!"); }
 		
 		return c;
Comment 16 Nikita Zlobin 2021-12-02 15:19:51 CET
Created attachment 102 [details]
jlscp-fix-utf8-escaping.diff

Forgot, that patches are better to be attached.
There's more complete one, for both in/out stream.
For me this fixes drum maps list, recent filenames list in instrument settings and sending it to linuxsampler (I hope, there's no duplicate code).
Marked as patch, but it may lack some specific info, since I generated it with svn diff from working copy.
Comment 17 Nikita Zlobin 2021-12-14 12:44:52 CET
Setting product to jlscp seems more logical, but jsampler is still affected, because it uses jlscp.jar from own sources. In order for change to take effect - patched jlscp.jar must be copied into jsampler sources before attempt to build.