Follow this Blog!

Sep 17, 2018

f Comment

Convert Big5 to UTF-8 and Insert into UTF-8 MySQL Table! 如何把 Big5 轉換成 UTF-8 並存在 MySQL 資料庫內?

Amazon Suppose you want to do the following:

1. Crawl the Chinese characters from a Big5 encoded webpage which most likely has this HTML markup:

<meta http-equiv="Content-Type" content="text/html; charset=big5">

2. Convert them to UTF-8 characters and insert them into a MySQL database table with UTF-8 encoding on a remote machine.

3. Use Putty to log in to the remote machine and view the non-garbage records correctly in the MySQL database table.

And you want to read the Chinese characters as UTF8 encoded characters. You certainly don't want to see gibberish like this:

mysql> select * from stock_test;
+------+---------------+
| id | name |
+------+---------------+
| 1101 | �泥 |
| 1102 | 亞泥 |
| 1103 | 嘉泥 |
| 1104 | ç’°æ³¥ |
| 1108 | 幸� |
| 1109 | 信大 |
| 1110 | �泥 |
+------+---------------+
7 rows in set (0.00 sec)

Instead, you want to see correctly encoded Chinese characters like this:

mysql> select * from stock_test;
+------+--------+
| id | name |
+------+--------+
| 1101 | 台泥 |
| 1102 | 亞泥 |
| 1103 | 嘉泥 |
| 1104 | 環泥 |
| 1108 | 幸福 |
| 1109 | 信大 |
| 1110 | 東泥 |
+------+--------+
7 rows in set (0.00 sec)

Exactly what should you do in order to see correct UTF-8 Chinese text instead of garbage characters or junk text when you query MySQL?

本文章會教你如何把 Big5 文字轉換成 UTF-8 文字,並正確的儲存到 UTF-8 MySQL 資料庫,你又可以用 Putty 讀到正確的中文字資料,而不是亂碼。

Create MySQL table

Here's how you would create a MySQL database table with UTF-8 encoding:

create table stock_test (
  id varchar(16) not null,
  name varchar(16) not null,
PRIMARY KEY (id)
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

The following way is less recommended but it also works:

create table stock_test (
  id varchar(16) not null,
  name varchar(16) not null,
PRIMARY KEY (id)
) DEFAULT CHARSET=utf8;

Create crawl script

I use PHP and Eclipse IDE to develop my crawl script to crawl data from a Big5 encoded webpage. I have to make sure the following is true:

- the script's file encoding is set to UTF-8 (right click on the file in Eclipse and select Properties)

- I execute the MySQL command "SET NAMES UTF8" after I connect to the database. The code looks like this:

$conn = new mysqli($servername, $username, $password, $dbname);

$conn->query("SET NAMES UTF8");

Once you crawl the Big5 encoded characters from the webpage, call some PHP function to convert them to UTF-8 characters. Here's the function I use:

$converted_text = iconv($sourceCharset, $targetCharset, $text);

In our case, the function call would look like this:

$converted_text = iconv('big5', 'UTF-8', $text);

Insert $converted_text into your MySQL database table.

Use Putty to log in to remote server

In Window -> Translation, make sure the "Remote character set" in the "Character set translation" section is set to UTF-8.

Connect to the remote server with Putty. My remote box is running Ubuntu 14.04.5 LTS.

Log in to MySQL server

Run this command to use the MySQL command-line tool:

mysql --default-character-set=utf8 -u user_name -ppassword stock_test

Depending on your MySQL settings, you may not need the option --default-character-set=utf8 but it doesn't hurt.

Run the following usual MySQL command to select all data from table stock_test:

select * from stock_test;

And you should see the correctly encoded Chinese characters with no gibberish.

Questions? Let me know!
Please leave a comment here!
One Minute Information - by Michael Wen
Find Michael on Google or Facebook
ADVERTISING WITH US - Direct your advertising requests to Michael