cherche sites avec encodages variés
Le
Une Bévue

Comme ruby n'utilisera plus iconv dans le futur, il me faudra utiliser
String#encode pour les conversions.
J'aimerais tester ça avec des encodages connus pour convertir en UTF-8.
Connaissez-vous des sites d'encodages connus ?
String#encode pour les conversions.
J'aimerais tester ça avec des encodages connus pour convertir en UTF-8.
Connaissez-vous des sites d'encodages connus ?
http://www.user.uni-hannover.de/nhtcapri/arabic.iso.html
http://www.user.uni-hannover.de/nhtcapri/arabic.win.html
http://www.user.uni-hannover.de/nhtcapri/baltic.iso.html
http://www.user.uni-hannover.de/nhtcapri/baltic.win.html
http://www.user.uni-hannover.de/nhtcapri/central-european.iso.html
http://www.user.uni-hannover.de/nhtcapri/central-european.mac.html
http://www.user.uni-hannover.de/nhtcapri/central-european.win.html
http://www.user.uni-hannover.de/nhtcapri/cyrillic.iso.html
http://www.user.uni-hannover.de/nhtcapri/cyrillic.koi.html
http://www.user.uni-hannover.de/nhtcapri/cyrillic.mac.html
http://www.user.uni-hannover.de/nhtcapri/cyrillic.win.html
http://www.user.uni-hannover.de/nhtcapri/greek.iso.html
http://www.user.uni-hannover.de/nhtcapri/greek.mac.html
http://www.user.uni-hannover.de/nhtcapri/greek.win.html
http://www.user.uni-hannover.de/nhtcapri/hebrew.iso.html
http://www.user.uni-hannover.de/nhtcapri/hebrew.win.html
http://www.user.uni-hannover.de/nhtcapri/south-european.iso.html
http://www.user.uni-hannover.de/nhtcapri/turkish.iso.html
http://www.user.uni-hannover.de/nhtcapri/turkish.mac.html
http://www.user.uni-hannover.de/nhtcapri/vietnamese.win.html
super, merci bien !
pas de pot, d'après lynx, tous ces sites sont déclarés comme étant en UTF-8.
je cherche des sites en iso 88-ç ou autre truc à la win* pour les
convertir en UTF-8.
Eh bien, ton Lynx à faux.
Firefox, par exemple, indique bien les bons encodages.
Peut-être ton Lynx est-il comme mon Elinks ? Il faut lui indiquer "à la
main" l'encodage de la page.
Ben oui, je viens de vérifier avec Chrome sous Xubuntu 12.04 un des
sites qui est bien en 8859 Arabic...
sur Dell / Xubuntu 12.04 j'ai :
.-(~/lib/ruby1.9)----------------------------------------------------()-
`--> lynx --version
Lynx Version 2.8.8dev.9 (12 Jun 2011)
libwww-FM 2.14, SSL-MM 1.4.1, GNUTLS 2.10.5, ncurses 5.9.20110404(wide)
Compilé le linux-gnu Jun 22 2011 09:52:55
Copyrights : Lynx Developers Group,
l'université du Kansas, le CERN, et d'autres contributeurs.
Distribué sous licence GNU General Public License (version 2).
Voyez http://lynx.isc.org/ et d'aide en ligne pour plus d'information.
et sur iMac / Mac OS X Lion, j'ai :
.-(~/lib/ruby1.9)--------------------------------------------------()-
`--> ssh iMac '/opt/local/bin/lynx --version'
Lynx Version 2.8.7rel.1 (05 Jul 2009)
libwww-FM 2.14, SSL-MM 1.4.1, OpenSSL 1.0.1c, ncurses 5.9.20110404
Built on darwin11.1.0 Aug 21 2011 08:15:51
Copyrights held by the Lynx Developers Group,
the University of Kansas, CERN, and other contributors.
Distributed under the GNU General Public License (Version 2).
See http://lynx.isc.org/ and the online help for more information.
See http://www.openssl.org/ for information about OpenSSL.
Je présents qu'il va me falloir vérifier tout ça.
Enfin, avec lynx tout ce que je fais est un simple :
$ lynx -head -dump <url>
comme indiqué aux pages :
Une Bévue
Et une réponse du style (via l'URL
http://www.user.uni-hannover.de/nhtcapri/arabic.iso.html):
HTTP/1.1 200 OK
Date: Sun, 24 Jun 2012 19:44:09 GMT
Server: Apache/2.2.16 (Debian)
Last-Modified: Wed, 04 Apr 2012 12:00:00 GMT
ETag: "10349e04-2da9-4bcd929ad7000"
Accept-Ranges: bytes
Content-Length: 11689
Vary: Accept-Encoding
Connection: close
Content-Type: text/html
n'indique pas que le document est en UTF-8.
Comme l'entête fourni par le serveur n'indique pas d'encodage, le
document (HTML) peut le choisir comme il le souhaite...
--
Paul Gaborit -
oui, le document donne d'ailleurs :
<meta http-equiv="Content-Type" content="charset=iso-8859-6">
dans le cas où le serv er n'indique rien, je dois donc me fier au document.
bon, maintenant, pour ces pages-là, j'obtiens bien :
$ tests/string_encode_test.rb
url = http://www.user.uni-hannover.de/nhtcapri/arabic.iso.html
charset = iso-8859-6
content_type = text/html
url = http://www.user.uni-hannover.de/nhtcapri/arabic.win.html
charset = windows-1256
content_type = text/html
url = http://www.user.uni-hannover.de/nhtcapri/baltic.iso.html
charset = iso-8859-13
content_type = text/html
url = http://www.user.uni-hannover.de/nhtcapri/baltic.win.html
charset = windows-1257
content_type = text/html
url = http://www.user.uni-hannover.de/nhtcapri/central-european.iso.html
charset = iso-8859-2
content_type = text/html
url = http://www.user.uni-hannover.de/nhtcapri/central-european.mac.html
charset = x-mac-ce
content_type = text/html
url = http://www.user.uni-hannover.de/nhtcapri/central-european.win.html
charset = windows-1250
content_type = text/html
url = http://www.user.uni-hannover.de/nhtcapri/cyrillic.iso.html
charset = iso-8859-5
content_type = text/html
url = http://www.user.uni-hannover.de/nhtcapri/cyrillic.koi.html
charset = iso-ir-111
content_type = text/html
url = http://www.user.uni-hannover.de/nhtcapri/cyrillic.mac.html
charset = x-mac-cyrillic
content_type = text/html
url = http://www.user.uni-hannover.de/nhtcapri/cyrillic.win.html
charset = windows-1251
content_type = text/html
url = http://www.user.uni-hannover.de/nhtcapri/greek.iso.html
charset = iso-8859-7
content_type = text/html
url = http://www.user.uni-hannover.de/nhtcapri/greek.mac.html
charset = x-mac-greek
content_type = text/html
url = http://www.user.uni-hannover.de/nhtcapri/greek.win.html
charset = windows-1253
content_type = text/html
url = http://www.user.uni-hannover.de/nhtcapri/hebrew.iso.html
charset = iso-8859-8-i
content_type = text/html
url = http://www.user.uni-hannover.de/nhtcapri/hebrew.win.html
charset = windows-1255
content_type = text/html
url = http://www.user.uni-hannover.de/nhtcapri/south-european.iso.html
charset = iso-8859-3
content_type = text/html
url = http://www.user.uni-hannover.de/nhtcapri/turkish.iso.html
charset = iso-8859-9
content_type = text/html
url = http://www.user.uni-hannover.de/nhtcapri/turkish.mac.html
charset = x-mac-turkish
content_type = text/html
url = http://www.user.uni-hannover.de/nhtcapri/vietnamese.win.html
charset = windows-1258
content_type = text/html
reste à passer en UTF-8...
Je ne comprend pas ce que vous voulez faire. Pour recevoir une aide
efficace, vous devriez préciser !
je cherche/cherchais des sites où l'encodage n'est pas UTF-8 pour :
pouvoir détecter l'encodage et une fois connu convertir en UT-8.
c'est pour tester un/des scripts ruby que je dois mettre à jour suite à
l'évolution future de ruby en matière de transcodage.
bon grâce aux URLs fournies par "Andreas Prilop", plus au dans le fil,
je suis parvenu à faire un petit test qui marche sur presque tous les
sites fournis.
pour certains l'encodage n'est pas reconnu par ruby, résultats provisoires :
url = http://www.user.uni-hannover.de/nhtcapri/arabic.iso.html
lynx_charset = iso-8859-6
file_orig = /home/yt/Sites/String_encode/arabic.iso.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/arabic.iso.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/arabic.win.html
lynx_charset = windows-1256
file_orig = /home/yt/Sites/String_encode/arabic.win.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/arabic.win.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/baltic.iso.html
lynx_charset = iso-8859-13
file_orig = /home/yt/Sites/String_encode/baltic.iso.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/baltic.iso.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/baltic.win.html
lynx_charset = windows-1257
file_orig = /home/yt/Sites/String_encode/baltic.win.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/baltic.win.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/central-european.iso.html
lynx_charset = iso-8859-2
file_orig = /home/yt/Sites/String_encode/central-european.iso.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/central-european.iso.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/central-european.mac.html
lynx_charset = x-mac-ce
file_orig = /home/yt/Sites/String_encode/central-european.mac.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
Encoding::ConverterNotFoundError
url = http://www.user.uni-hannover.de/nhtcapri/central-european.win.html
lynx_charset = windows-1250
file_orig = /home/yt/Sites/String_encode/central-european.win.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/central-european.win.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/cyrillic.iso.html
lynx_charset = iso-8859-5
file_orig = /home/yt/Sites/String_encode/cyrillic.iso.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/cyrillic.iso.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/cyrillic.koi.html
lynx_charset = iso-ir-111
file_orig = /home/yt/Sites/String_encode/cyrillic.koi.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
Encoding::ConverterNotFoundError
url = http://www.user.uni-hannover.de/nhtcapri/cyrillic.mac.html
lynx_charset = x-mac-cyrillic
file_orig = /home/yt/Sites/String_encode/cyrillic.mac.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
Encoding::ConverterNotFoundError
url = http://www.user.uni-hannover.de/nhtcapri/cyrillic.win.html
lynx_charset = windows-1251
file_orig = /home/yt/Sites/String_encode/cyrillic.win.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/cyrillic.win.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/greek.iso.html
lynx_charset = iso-8859-7
file_orig = /home/yt/Sites/String_encode/greek.iso.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/greek.iso.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/greek.mac.html
lynx_charset = x-mac-greek
file_orig = /home/yt/Sites/String_encode/greek.mac.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
Encoding::ConverterNotFoundError
url = http://www.user.uni-hannover.de/nhtcapri/greek.win.html
lynx_charset = windows-1253
file_orig = /home/yt/Sites/String_encode/greek.win.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/greek.win.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/hebrew.iso.html
lynx_charset = iso-8859-8-i
file_orig = /home/yt/Sites/String_encode/hebrew.iso.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
Encoding::ConverterNotFoundError
url = http://www.user.uni-hannover.de/nhtcapri/hebrew.win.html
lynx_charset = windows-1255
file_orig = /home/yt/Sites/String_encode/hebrew.win.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/hebrew.win.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/south-european.iso.html
lynx_charset = iso-8859-3
file_orig = /home/yt/Sites/String_encode/south-european.iso.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/south-european.iso.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/turkish.iso.html
lynx_charset = iso-8859-9
file_orig = /home/yt/Sites/String_encode/turkish.iso.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/turkish.iso.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/turkish.mac.html
lynx_charset = x-mac-turkish
file_orig = /home/yt/Sites/String_encode/turkish.mac.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
Encoding::ConverterNotFoundError
url = http://www.user.uni-hannover.de/nhtcapri/vietnamese.win.html
lynx_charset = windows-1258
file_orig = /home/yt/Sites/String_encode/vietnamese.win.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
url = http://www.user.uni-hannover.de/nhtcapri/arabic.iso.html
lynx_charset = iso-8859-6
file_orig = /home/yt/Sites/String_encode/arabic.iso.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/arabic.iso.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/arabic.win.html
lynx_charset = windows-1256
file_orig = /home/yt/Sites/String_encode/arabic.win.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/arabic.win.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/baltic.iso.html
lynx_charset = iso-8859-13
file_orig = /home/yt/Sites/String_encode/baltic.iso.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/baltic.iso.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/baltic.win.html
lynx_charset = windows-1257
file_orig = /home/yt/Sites/String_encode/baltic.win.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/baltic.win.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/central-european.iso.html
lynx_charset = iso-8859-2
file_orig = /home/yt/Sites/String_encode/central-european.iso.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/central-european.iso.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/central-european.mac.html
lynx_charset = x-mac-ce
file_orig = /home/yt/Sites/String_encode/central-european.mac.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
Encoding::ConverterNotFoundError
url = http://www.user.uni-hannover.de/nhtcapri/central-european.win.html
lynx_charset = windows-1250
file_orig = /home/yt/Sites/String_encode/central-european.win.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/central-european.win.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/cyrillic.iso.html
lynx_charset = iso-8859-5
file_orig = /home/yt/Sites/String_encode/cyrillic.iso.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/cyrillic.iso.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/cyrillic.koi.html
lynx_charset = iso-ir-111
file_orig = /home/yt/Sites/String_encode/cyrillic.koi.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
Encoding::ConverterNotFoundError
url = http://www.user.uni-hannover.de/nhtcapri/cyrillic.mac.html
lynx_charset = x-mac-cyrillic
file_orig = /home/yt/Sites/String_encode/cyrillic.mac.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
Encoding::ConverterNotFoundError
url = http://www.user.uni-hannover.de/nhtcapri/cyrillic.win.html
lynx_charset = windows-1251
file_orig = /home/yt/Sites/String_encode/cyrillic.win.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/cyrillic.win.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/greek.iso.html
lynx_charset = iso-8859-7
file_orig = /home/yt/Sites/String_encode/greek.iso.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/greek.iso.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/greek.mac.html
lynx_charset = x-mac-greek
file_orig = /home/yt/Sites/String_encode/greek.mac.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
Encoding::ConverterNotFoundError
url = http://www.user.uni-hannover.de/nhtcapri/greek.win.html
lynx_charset = windows-1253
file_orig = /home/yt/Sites/String_encode/greek.win.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/greek.win.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/hebrew.iso.html
lynx_charset = iso-8859-8-i
file_orig = /home/yt/Sites/String_encode/hebrew.iso.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
Encoding::ConverterNotFoundError
url = http://www.user.uni-hannover.de/nhtcapri/hebrew.win.html
lynx_charset = windows-1255
file_orig = /home/yt/Sites/String_encode/hebrew.win.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/hebrew.win.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/south-european.iso.html
lynx_charset = iso-8859-3
file_orig = /home/yt/Sites/String_encode/south-european.iso.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/south-european.iso.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/turkish.iso.html
lynx_charset = iso-8859-9
file_orig = /home/yt/Sites/String_encode/turkish.iso.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
file_utf8 = /home/yt/Sites/String_encode/turkish.iso.to_utf8.html
url = http://www.user.uni-hannover.de/nhtcapri/turkish.mac.html
lynx_charset = x-mac-turkish
file_orig = /home/yt/Sites/String_encode/turkish.mac.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
Encoding::ConverterNotFoundError
url = http://www.user.uni-hannover.de/nhtcapri/vietnamese.win.html
lynx_charset = windows-1258
file_orig = /home/yt/Sites/String_encode/vietnamese.win.html
line = <meta http-equiv="Content-Type" content="charset=UTF-8">
Encoding::ConverterNotFoundError
ceux où il y a "Encoding::ConverterNotFoundError" n'ont pas été
transcodés, mais ce n'est, peut-être, qu'une question de nom donné par
l'en-tête ou le fichier html à l'encodage qui ne serait pas "normalisé...