HKUST Library Pinyin Conversion Project
In Phase II of the Conversion Project, we need to convert into Pinyin the Wade-Giles romanized personal name headings in our bibliographic records in accordance with the Library of Congress' (LC) authority file. In order to save the time and effort needed for identifying the not-to-convert personal names, we have decided to make use of LC's Exclusion List and the converted authority records from OCLC as our basis for conversion. We started to work on Phase II and III conversion in September 2000 as we anticipated receiving the authority records from OCLC around that time.
The goals of Phase II Conversion are two-fold:
Both Chinese and Non-Chinese Records that contain Wade-Giles romanized names will be dealt with in this Phase. Excluded from this conversion are:
A major task in this phase of conversion is to compile a Master Name List (MNL) database, which will serve as a tool to help us decide whether and how to convert the personal names because not all the names are to be changed automatically to Pinyin. Some names will retain their Wade-Giles or original forms as their established headings, e.g. names of persons in Hong Kong (e.g. Chan, Tak Po) or Singapore (e.g. Lee, Kuan Yew). The conversion consists of two main steps: building a Master Name List database (MNL) and converting name headings based on the MNL.
The MNL is a database containing name pairs: personal name headings (in Wade-Giles/original form) and their corresponding valid forms (converted into Pinyin or retaining their original forms). When name headings of bibliographic records are processed against the MNL, a matching process takes place. The computer program will check the Wade-Giles name heading against that in the MNL. If a matching Wade-Giles entry is found, it will be changed according to its valid form in MNL.
Sources of names for the MNL:
LC Exclusion List
The LC Exclusion List, which we obtained from the Pinyin Project Group of the Library of Congress in June 2000, contains personal names that, according to LC conversion guidelines, should not be converted into Pinyin. For example, the heading for Chiang Ching-kuo, is commonly romanized in Wade-Giles form and is to retain its original form. There are 2,459 names on the List. In the MNL, the valid forms of these names will be the same as the original forms.
OCLC Authority Records
Another source of names for the MNL is the personal name authority records that come from OCLC through our subscribed authority control service. A computer program has been used to extract those personal name records that have the LC Pinyin markers 'c' (fully/partially converted) or 'n' (considered for conversion but was not converted) appearing in the 008/07 field. The Wade-Giles names in tag 400 and the established names in tag 100 are extracted from these authority records and added into MNL in name pairs.
Local Bibliographic Records
The third source of data for the MNL is the name headings from the local bibliographic records in our library catalog. These name headings do not appear in the LC Exclusion List and do not have linked authority records. They will be extracted from bibliographic records, converted into Pinyin by a locally-developed computer program (nExtractBib), and added to the MNL in name pairs.
After completing the MNL, the actual conversion of name headings in bibliographic records can begin. The bibliographic records will be output and the Wade-Giles names in the relevant fields matched against names on the finalized version of the MNL. Matching headings will be converted to the valid form accordingly and the local Pinyin marker 'hkust' will then be added to tag 910 |c in the record.
After the completion of Phase I conversion in July 2000, we started to plan for the next two phases of the Project. While waiting for OCLC to supply us with the converted authority records, we worked on the computer programs that would be needed for the next conversion phases. A series of computer programs have been developed in-house and the major ones include:
To identify eligible records for conversion, some rules based on the common patterns of Wade-Giles names have been formulated and built into the first two extraction programs described above. These rules help us to determine whether a name is romanized in Wade-Giles or not before the extraction. (Please refer to the Rules for Determining Wade-Giles Name Headings for details.)
Bibliographic and authority records in the input file for processing through these programs must be in ISO 2709 format. These programs have now been made available for public use. (Please refer to the Program Documentation for details.)
When the computer programs were finished, we ran our first test on 4,000 bibliographic records to see if the extracting and converting processes would work as expected. They were matched against the MNL, which, at that time, consisted of only the LC Exclusion List and the OCLC converted authority records.
As non-Wade-Giles personal name headings can be filtered by a set of well-defined rules in the programs, ambiguity is not as big a problem in Phase II as in Phase I. For example, "I" in personal names can be converted to "Yi" almost without doubt. Therefore, manual review of converted names based on ambiguity of Wade-Giles strings was deemed unnecessary in Phase II.
Another larger-scale test was run on about 87,000 bibliographic records (Chinese records with 'chi' in the language code). Careful review was carried out on the test results to detect possible incorrect conversions. A number of problems were found during the review and solutions were then sought to optimize the accuracy of conversion.
Romanization in tag 880 that represents unavailable Chinese characters in the system has not been converted, e.g.
880 1 |700-04/$1|a石[Tun],|d1128-1182
Solution: Since these strings should be converted and they are a small number, they will be converted manually after the rest of the conversion project is completed. A review file will be created to extract these records for conversion.
Some Hong Kong and Singaporean names are romanized according to the local dialect and happen to be in Wade-Giles form. These names will be incorrectly converted, e.g. Chan, Kai-ming (陳啟明) will be mis-converted into Zhan, Gaiming; Lo, Kai-yin (羅啟妍) will be mis-converted into Luo, Gaiyin.
Solution: To identify these potentially mis-converted headings, a list was created to extract bibliographic records that contain Wade-Giles strings and whose country of publication is not China (country = cc), or Taiwan (country = ch). Because of the complexities in the pros and cons of different approaches for solving this problem, we decided to adopt a simplistic way of tackling these 3,000 headings. The records were separated into two groups:
Another conversion error occurs when a bibliographic heading happens to have the same romanized form as tag 400 of the authority record for another person. e.g.
|Bibliographic Records with heading 天昊||Authority Records of 田浩|
|100 1 |6880-01| |aTien, Hao||100 1 |aTillman, Hoyt Cleveland|
|880 1 |6100-01/$1 |a天昊||400 1 |aTien, Hao|
When the romanized form (Tien Hao) of "天昊" from a bibliographic record was matched against the MNL, it was incorrectly converted to "Tillman, Hoyt Cleveland" instead of "Tian, Hao" because it had the same romanized form as the 400 of the authority record of "田浩". About 180 name headings were found to be susceptible to this problem.
Solution: These names will be reviewed and manually converted. A computer program is used to find these headings by matching the converted bibliographic headings against tag 400 in the authority records. If they are in the same form, the records will be output for review.
To enable the conversion to proceed without delay, the original forms of these names were inserted into MNL to exclude them from automatic conversion. The review and manual conversion exercises will be done after the machine conversion.
The final version of the MNL was ready by December 11, 2000 with names from the following sources loaded. (For steps in the creation of the MNL and the data structure of the database, please refer to the Program Documentation.)
When the MNL was completed on December 11, 2000, we were ready to begin the conversion of bibliographic records. The eligible entries from Chinese and non-Chinese records were retrieved separately from the INNOPAC for processing.
As we have been creating records with Pinyin romanization according to LC guidelines and including a local Pinyin marker in tag 910 |c (in addition to LC Pinyin Marker tag 987) since October 1, 2000 (Day 1), these records were then exempted from the conversion process to avoid double conversion.
On December 11, 2000, we started the process by converting our non-Chinese records first. From our 333,919 non-Chinese bibliographic records, 3,490 records were identified by the computer program as containing Wade-Giles names in Tag 100, 400, 600, 700 or 800. They were converted based on the final version of the MNL. We did the non-Chinese records first because of the smaller number of records involved in this set. No moratorium on the editing of records was needed because the conversion was completed within an hour.
The next day we began working on our Chinese records. 87,537 records were output and processed through the computer program. The entire conversion and re-loading of records to the INNOPAC was finished on December 15, 2000. During the conversion period, editing of all Chinese records on the INNOPAC was withheld.
We made every attempt to shorten the time lag between the compilation of the MNL and the actual conversion of the bibliographic records, and consequently only 38 name headings were not found in the MNL. The bibliographic records with these headings were converted manually afterwards.
Phase II conversion was completed on December 15, 2000. The personal name headings of 91,027 records (87,537 Chinese records and 3,490 non-Chinese records) have been processed and all eligible personal name headings converted except for the following records that require manual review and conversion.