How to Fix CSV Encoding Problems: UTF-8, Shift-JIS, and Beyond
You downloaded a CSV, opened it, and instead of readable text you got é, 繧ア繧ォ, or a wall of question marks. That's an encoding mismatch — the file was saved in one character encoding, but your app is reading it in another.
The fix is straightforward once you know which encoding you're dealing with. This guide covers five methods to convert your file (from text editors to Python one-liners), explains the BOM problem that trips up even experienced users, and shows how to stop dealing with encoding issues altogether.
5 Ways to Fix a CSV File with Encoding Problems
Before converting anything, figure out what encoding the file actually uses. Then pick the method that fits your workflow.
Identify the Encoding First (Pattern Cheat Sheet)
Garbled characters aren't random. The specific garbage you see tells you exactly what went wrong.
| What you see | What it means | Actual encoding → Read as |
|---|---|---|
é instead of é |
Windows-1252 read as UTF-8 | Windows-1252 → UTF-8 |
ü instead of ü |
Windows-1252 read as UTF-8 | Windows-1252 → UTF-8 |
“ instead of " |
Windows-1252 smart quotes read as UTF-8 | Windows-1252 → UTF-8 |
繧ア繧ォ繧ォ |
Shift-JIS read as UTF-8 | Shift-JIS → UTF-8 |
ÿþ at file start |
UTF-16 Little Endian BOM | UTF-16 LE → ASCII/UTF-8 |
 at file start |
UTF-8 BOM displayed as text | UTF-8 BOM → Windows-1252 |
? replacing characters |
Encoding conversion lost unmappable characters | Various → lossy conversion |
Most text editors show the detected encoding in the status bar. In VS Code, it's in the bottom-right corner. In Notepad++, it's in the bottom bar labeled "UTF-8", "ANSI", etc. If the status bar says "UTF-8" but the text looks wrong, the file probably isn't actually UTF-8 — the editor guessed wrong.
Fix It in a Text Editor — Notepad++, VS Code, or Sublime Text
This is the most common approach and works for any file size you'd open in an editor.
VS Code:
- Open the CSV file
- Look at the encoding label in the bottom-right status bar (it might say "UTF-8" even if that's wrong)
- Click the encoding label → select "Reopen with Encoding"
- Choose the correct source encoding (e.g., "Shift JIS", "Windows 1252")
- The text should now display correctly
- Click the encoding label again → select "Save with Encoding" → choose "UTF-8"
Notepad++:
- Open the CSV file
- Go to Encoding menu → check what's currently selected
- Select "Encode in UTF-8" (or "Encode in UTF-8-BOM" if you need Excel compatibility)
- Save the file
The critical distinction here: "Reopen with Encoding" re-reads the raw bytes using a different decoder — it's diagnostic. "Save with Encoding" re-encodes the current text and writes it — it's the actual conversion. If you "Save with Encoding" on a file that's already displaying garbled text, you'll permanently bake in the corruption. Always reopen with the correct encoding first, confirm the text looks right, then save.
Fix It in Excel — the Import Wizard Method
Double-clicking a CSV in Excel doesn't let you pick an encoding — Excel just guesses (usually wrong for non-ASCII files). The import wizard gives you control.
- Open Excel with a blank workbook
- Go to Data tab → From Text/CSV (or "Get Data" → "From File" → "From Text/CSV")
- Select your CSV file
- In the preview dialog, find the "File Origin" or "Encoding" dropdown
- Switch between encodings until the preview shows correct text:
- Try 65001: Unicode (UTF-8) first
- If that's garbled, try 932: Japanese (Shift-JIS) or 1252: Western European (Windows)
- Click Load
This method doesn't modify the original file — it just reads it correctly into Excel. If you want to save a properly encoded version, use "Save As" → "CSV UTF-8 (Comma delimited)".
Fix It from the Command Line — iconv and PowerShell
Command-line tools are the best option for batch processing or scripting into a data pipeline.
macOS / Linux (iconv):
# Convert Shift-JIS to UTF-8
iconv -f SHIFT_JIS -t UTF-8 input.csv > output_utf8.csv
# Convert Windows-1252 to UTF-8
iconv -f WINDOWS-1252 -t UTF-8 input.csv > output_utf8.csv
# Detect encoding first
# macOS:
file -I input.csv
# Linux:
file -i input.csv
# Output: input.csv: text/csv; charset=shift_jis
If iconv throws an illegal input sequence error, the source file likely contains characters that don't exist in the target encoding (common when converting to Shift-JIS from UTF-8 files that include emoji or symbols). Add the -c flag to skip unmappable characters and continue:
iconv -f UTF-8 -c -t SHIFT_JIS input.csv > output_sjis.csv
Characters that are skipped will be silently dropped — always inspect the output before discarding the original.
Batch conversion (all CSVs in a folder):
for f in *.csv; do
iconv -f SHIFT_JIS -t UTF-8 "$f" > "utf8_${f}"
done
Windows PowerShell:
# Read with one encoding, write as UTF-8
Get-Content -Path input.csv -Encoding Default |
Set-Content -Path output_utf8.csv -Encoding UTF8
PowerShell encoding caveat: In PowerShell 5.x (the default on Windows 10 and 11), -Encoding UTF8 produces UTF-8 with BOM. This is fine for Excel users, but will break Python scripts and most data pipelines. If you need BOM-free UTF-8, either use PowerShell 7+ with -Encoding utf8NoBOM, or pipe through a .NET method:
# PowerShell 7+: BOM-free UTF-8
Get-Content -Path input.csv -Encoding Default |
Set-Content -Path output_utf8.csv -Encoding utf8NoBOM
# PowerShell 5.x: BOM-free UTF-8 workaround
$content = Get-Content -Path input.csv -Encoding Default
[System.IO.File]::WriteAllLines("output_utf8.csv", $content)
If iconv throws an error on Mac/Linux, you probably have the source encoding wrong. Try a different -f value. The file -I / file -i command can help you identify the actual encoding before converting.
Fix It in Python — pandas and codecs
For developers or anyone already working in Python, this is often the fastest path. It handles edge cases better than most GUI tools.
Basic conversion with pandas:
import pandas as pd
# Read with the source encoding
df = pd.read_csv('input.csv', encoding='shift_jis')
# Write as UTF-8
df.to_csv('output_utf8.csv', encoding='utf-8', index=False)
Auto-detect encoding with chardet:
import chardet
with open('input.csv', 'rb') as f:
result = chardet.detect(f.read())
print(result)
# {'encoding': 'SHIFT_JIS', 'confidence': 0.99, 'language': 'Japanese'}
df = pd.read_csv('input.csv', encoding=result['encoding'])
df.to_csv('output_utf8.csv', encoding='utf-8', index=False)
Without pandas (pure Python):
import codecs
with codecs.open('input.csv', 'r', encoding='shift_jis') as source:
with codecs.open('output_utf8.csv', 'w', encoding='utf-8') as target:
target.write(source.read())
A word of caution on chardet: it's a statistical guesser, not a decoder. It works well on large files with lots of text, but can guess wrong on short files or files with mostly ASCII content. Always inspect the output after conversion.
The BOM Problem — When UTF-8 Isn't Enough
BOM stands for Byte Order Mark — three invisible bytes (EF BB BF) at the very start of a file. It's a flag that says "this file is UTF-8." And it's the source of an annoying split in the CSV world.
BOM Makes Excel Happy but Breaks Everything Else
Here's the dilemma:
- Excel on Windows needs BOM to auto-detect UTF-8. Without it, Excel falls back to your system's default encoding (often Windows-1252 or Shift-JIS) and garbles the text.
- Python, CLI tools, and most web applications treat BOM as data. You'll see an extra invisible character (
\ufeff) in the first cell or column name, which breaks column lookups and key matching.
| Scenario | Use BOM? | Why |
|---|---|---|
| CSV for Excel users | Yes | Excel won't auto-detect UTF-8 without it |
| CSV for Python / data pipelines | No | BOM becomes a phantom character in column headers |
| CSV for web app import | No | Most web apps don't expect BOM |
| CSV for Japanese Excel (Shift-JIS required) | N/A | Save as Shift-JIS instead of UTF-8 |
How to Add or Remove BOM
Add BOM (for Excel compatibility):
- Notepad++: Encoding → "UTF-8-BOM" → Save
- Python:
open('out.csv', 'w', encoding='utf-8-sig') - VS Code: Click encoding → "Save with Encoding" → "UTF-8 with BOM"
- PowerShell:
-Encoding UTF8in PowerShell 5.x adds BOM automatically
Remove BOM (for programmatic use):
- VS Code: Click encoding → "Save with Encoding" → "UTF-8"
- Python: Read with
utf-8-sig(strips BOM automatically), write withutf-8 - Command line:
sed -i '1s/^\xEF\xBB\xBF//' file.csv - PowerShell 7+:
-Encoding utf8NoBOM
# Python: remove BOM during conversion
with open('input.csv', 'r', encoding='utf-8-sig') as f:
content = f.read()
with open('output.csv', 'w', encoding='utf-8') as f:
f.write(content)
Which Method Should You Use?
| Method | OS | Difficulty | Batch Support | Best for |
|---|---|---|---|---|
| Text editor (VS Code / Notepad++) | Any | Easy | No | Quick one-off fix, any encoding |
| Excel Import Wizard | Windows / Mac | Easy | No | Non-technical users, preview before commit |
| iconv (command line) | Mac / Linux | Medium | Yes | Batch conversion, scripting |
| PowerShell | Windows | Medium | Yes | Windows automation |
| Python (pandas) | Any | Medium | Yes | Developers, data pipelines |
| Online tool (browser-based, local processing) | Any | Easy | Limited | No install needed, encoding handled automatically |
For a single file you need to fix right now, a text editor is the fastest. For recurring CSV exports, set up a Python script or iconv command and forget about it.
3 Mistakes That Corrupt Your Data During Conversion
Encoding conversion is simple in concept but has a few traps that can silently destroy data.
Overwriting the Original File
This is the most common and most painful mistake. If a conversion goes wrong — wrong source encoding, unmappable characters, interrupted write — and you saved over the original, that data is gone.
Always save to a new file name. output_utf8.csv, not input.csv. Multi-byte encodings like Shift-JIS and GBK are especially risky: characters that don't have a UTF-8 equivalent get silently replaced with ? during conversion. You won't notice until someone points out the missing data.
Trusting Auto-Detection Blindly
Tools like Python's chardet, VS Code's auto-detect, and Notepad++'s encoding guess are all heuristics. They analyze byte patterns and make a statistical prediction. They're usually right, but they fail in predictable ways:
- Short files (under 100 bytes): not enough data to guess reliably
- ASCII-heavy files with a few special characters: multiple encodings produce identical byte sequences for ASCII
- Mixed-encoding files: some rows in UTF-8, others in Windows-1252 (this happens more often than you'd think with merged datasets)
Always open the converted file and scan for garbled characters before deleting the original. Spot-check rows that contain accented characters, CJK text, or currency symbols.
Mixing Up "Reopen with Encoding" and "Save with Encoding"
In VS Code and similar editors, these are two different operations:
- "Reopen with Encoding": Re-reads the same bytes from disk using a different decoder. Non-destructive. Use this to figure out the correct encoding.
- "Save with Encoding": Takes the currently displayed text and writes it to disk in a new encoding. Destructive.
If the file is already showing garbled text and you hit "Save with Encoding → UTF-8", you're encoding the garbled characters as UTF-8. The corruption is now permanent. The correct sequence is: Reopen with the right encoding → verify text is correct → then save with the target encoding.
Stop Converting Files — Fix the Source Instead
If you're converting CSVs every week, the real problem isn't the file — it's the system that produced it. Here's how to eliminate encoding issues at the root.
Standardize on UTF-8 Across Your Pipeline
Most encoding problems disappear when everyone agrees on UTF-8.
- If you export CSVs from Excel: Use "Save As" → "CSV UTF-8 (Comma delimited)" instead of plain "CSV". This saves as UTF-8 with BOM, which other Excel users can open without issues.
- If you write code that produces CSVs: Always specify
encoding='utf-8'explicitly. Don't rely on system defaults — they vary by OS and locale. - If you receive CSVs from partners: Ask them to export in UTF-8. Most modern systems support it. For legacy systems stuck on Shift-JIS or Windows-1252, set up an automated conversion script as a preprocessing step.
Use Tools That Handle Encoding Automatically
The easiest way to deal with encoding is to not deal with it at all. Some browser-based CSV tools auto-detect encoding on upload — you drag in a Shift-JIS file, a Windows-1252 file, or a UTF-8 file, and it just works.
LeapRows (disclosure: built by the author), for example, processes files entirely in the browser without uploading them to a server. It automatically detects and handles CSV encoding regardless of whether the file is UTF-8, Shift-JIS, or another format — no manual conversion, no encoding menus, no guessing. It also handles analysis tasks like filtering, pivoting, and aggregation directly in the browser, so you can skip the Excel round-trip entirely.
For teams that regularly process CSVs from multiple sources with different encodings, a tool like this removes an entire class of errors from the workflow. The key distinction to look for is local (in-browser) processing — tools that upload your file to a server still require you to think about what data you're sharing, even if they handle encoding automatically.
Wrapping Up
CSV encoding problems boil down to one thing: the file was saved in encoding A, but your tool is reading it as encoding B. Once you identify the mismatch (use the pattern cheat sheet above), the fix takes under a minute with any of the five methods covered here.
The quick version:
- Check the garbled pattern to identify the source encoding
- Convert using whichever tool fits — text editor for one-offs, Python or iconv for batches
- Watch out for BOM: add it if Excel users need the file, remove it if code will process it; and be aware that PowerShell 5.x's
-Encoding UTF8adds BOM by default - Never overwrite the original until you've verified the conversion
Long-term, push for UTF-8 everywhere. The fewer encoding decisions humans have to make, the fewer encoding problems you'll have.