Importing CSV files with non-standard characters

MDAustin · October 12, 2009, 10:08pm

As part of my learning I am making a name generator, for my game, which to all purposes is working until I enter non-standard characters.

Here is my test.csv file:

"Bergström",0.2
"Kwŏn",0.2
"Muñoz",0.2
"Huáng",0.2
"Itō",0.2

and here is what I am using to import it:

probsurnames = [(surname,float(prob)) for surname,prob in csv.reader(open('test.csv'))]

Saving as the .csv as Unicode returns this:

File “trial.py”, line 18, in
probsurnames = [(surname,float(prob)) for surname,prob in csv.reader(open(‘test.csv’))]
_csv.Error: line contains NULL byte

Saving the .csv as UTF-8 returns this as an example:

ï»¿“BergstrÃ¶m”

It’s kind of gutting to get so close on this - I have tried the wrapper from the Python doc 14.1 but can’t get that to work either. Any help is greatly appreciated.

enn0x · October 13, 2009, 6:24am

How are you writing the csv file? Not closing the file after writing is a common mistake, especially if you open a file without specifying the io-mode (you always should).

enn0x

drwr · October 13, 2009, 6:27am

Instead of open(‘test.csv’), try open(‘test.csv’, ‘r’, ‘utf-8’).

David

MDAustin · October 13, 2009, 7:30pm

Unfortunately this does not work for me I have tried to get the unicode_csv_reader from the Python documentation to work also to no avail. Once I have this cracked I will post the answer in case anyone else has the same problem.

drwr · October 13, 2009, 7:58pm

Well, start by removing csv from the problem. Can you simply read the file in utf-8 mode and get the characters properly? If not, then there’s a problem with the way you’re saving out the file in the first place.

Note that Microsoft tools such as Notepad and Wordpad don’t write a strictly standard utf-8 file. They add two secret bytes to the front of the file that’s supposed to indicate the encoding used in the file, but this is a Microsoft-only convention and Python may not necessarily know about it.

David

MDAustin · October 14, 2009, 11:35am

This was the exact problem. Thank you for guiding how to fix this Hopefully with a bit more time I can get something a bit more substantial to show.

MDAustin · October 14, 2009, 4:31pm

This is the code that actually did the business for me. As an implication of an issue highlighted above though the first line of the .csv needs to be not used - for example the first line of my .csv reads “Error”, 0.00,

I am sure there is a way around this but my brain can’t find it.

def encoded_csv_reader_to_unicode(encoded_csv_data, 
                      coding='utf-8', 
                      dialect=csv.excel,
                      **kwargs):
               csv_reader = csv.reader(encoded_csv_data, 
                      dialect=dialect, **kwargs)
               for row in csv_reader:
                     yield [unicode(cell, coding) for cell in row]
        
encoded_csv_reader_to_unicode(open(myfile))]