Converting character sets in text

Following a problem I had when trying to checkout old files from CVS repository I found out how to display the hex value of certain characters and how to convert them as well.
Most older filenames have been encoded with the character set ISO-8859-1(latin-1) or ISO-8859-2(European) or ISO-8859-15(European+EURO sign). Mopst new systems are working with the UTF-8.

Examples illustrate better:
Here is the filename I got from the old CVS repository:
ls *.jpg Architektur�bersicht_Gesamt�berblick.jpg

If I look at the type of hexcode the filename contains:

ls *.jpg | hexdump -cb 0000000 A r c h i t e k t u r � b e r s 0000000 101 162 143 150 151 164 145 153 164 165 162 374 142 145 162 163 0000010 i c h t _ G e s a m t � b e r b 0000010 151 143 150 164 137 107 145 163 141 155 164 374 142 145 162 142 0000020 l i c k . j p g \n 0000020 154 151 143 153 056 152 160 147 012 0000029

We can see that the funny characters ‘�’ are having the Hex value ‘374’ which is the German ‘ü’ coded in ISO-8859-1. To be able to see it in a system which uses the locale UTF-8 you can pass it through the char code converter ‘recode’ and here is what I get:
ls *.jpg | recode ISO-8859-1..UTF-8 Architekturübersicht_Gesamtüberblick.jpg

Note: In Debian Linux system, the ‘recode’ tool can be installed with the command:
apt-get install recode

IT Tips and Tricks

Like this:

Share this:

Like this: