probsurnames = [(surname,float(prob)) for surname,prob in csv.reader(open('test.csv'))]
Saving as the .csv as Unicode returns this:
File “trial.py”, line 18, in
probsurnames = [(surname,float(prob)) for surname,prob in csv.reader(open(‘test.csv’))]
_csv.Error: line contains NULL byte
Saving the .csv as UTF-8 returns this as an example:
“Bergström”
It’s kind of gutting to get so close on this - I have tried the wrapper from the Python doc 14.1 but can’t get that to work either. Any help is greatly appreciated.
How are you writing the csv file? Not closing the file after writing is a common mistake, especially if you open a file without specifying the io-mode (you always should).
Unfortunately this does not work for me I have tried to get the unicode_csv_reader from the Python documentation to work also to no avail. Once I have this cracked I will post the answer in case anyone else has the same problem.
Well, start by removing csv from the problem. Can you simply read the file in utf-8 mode and get the characters properly? If not, then there’s a problem with the way you’re saving out the file in the first place.
Note that Microsoft tools such as Notepad and Wordpad don’t write a strictly standard utf-8 file. They add two secret bytes to the front of the file that’s supposed to indicate the encoding used in the file, but this is a Microsoft-only convention and Python may not necessarily know about it.
This is the code that actually did the business for me. As an implication of an issue highlighted above though the first line of the .csv needs to be not used - for example the first line of my .csv reads “Error”, 0.00,
I am sure there is a way around this but my brain can’t find it.
def encoded_csv_reader_to_unicode(encoded_csv_data,
coding='utf-8',
dialect=csv.excel,
**kwargs):
csv_reader = csv.reader(encoded_csv_data,
dialect=dialect, **kwargs)
for row in csv_reader:
yield [unicode(cell, coding) for cell in row]
encoded_csv_reader_to_unicode(open(myfile))]