WorldCat Discovery release notes, Thai language search and sort
Release Date: August 2024
Introduction
The following release notes are for Thai language searching and sorting support in WorldCat Discovery, completed August 2024.
WorldCat Discovery now includes the following enhancements to searching and sorting for the Standard Thai language that a native speaker expects:
Search
- We tokenize phrases to identify individual Thai words when building and parsing queries for word indexes.
- We maintain a list of common Thai words that we combine with adjacent words when we build word indexes and parse word index queries.
Sort
- We apply Unicode collation to sort Thai script along with characters from all other writing systems
These Thai language searching and sorting improvements complement WorldCat Discovery’s Thai language user interface.
Standard Thai search and sort features
Search
Your library’s users searching for words or phrases in Standard Thai now get search results that meet the expectations of native Thai speakers. This is achieved through:
Normalization
We do not apply normalization to any Thai indexes. We always treat all Thai vowel symbols and diacritics as significant and never ignore them.
Exception: Characters with tone marks: composed/decomposed
We index Thai characters that have tone marks in both composed and decomposed forms.
Example:
When entered in composed form in a search term, the character will match both composed and decomposed forms in the index and vice versa: if searched in decomposed form it will match both the decomposed and composed form.
Tokenization
Because Thai script phrases are written without spaces between words, to recognize and index individual Thai words, we apply tokenization to build all Thai word indexes and to parse word index queries. We do not apply tokenization for phrase indexes or phrase index queries.
Indexing of individual Thai words enables word index searching whereby records containing the query terms anywhere in the appropriate indexed fields are retrieved.
Example
|
Thai |
English translation |
Query |
ti:สนทนาภาษาจีน |
Chinese conversation |
Tokenized query |
· สนทนา · ภาษา · จีน
|
· talk/converse · language · China |
Matching record title |
สนทนา 3 ภาษา ไทย-อังกฤษ-จีน โต้ตอบอย่างมั่นใจ พิชิตงานบริการในโรงแรม |
Conversation in three languages: Thai-English-Chinese. Respond confidently and conquer service jobs in hotels. |
Tokenized record title |
สนทนา 3 ภาษา ไทย อังกฤษ จีน โต้ตอบ อย่าง มั่นใจ พิชิต งาน บริการ บริการ_ใน ใน ใน_โรงแรม โรงแรม |
Talk/Converse 3 language Thai England China respond at/manner confident conquer work service service_in in in_hotel hotel |
Common words
Referring to the list of common Standard Thai words below, rather than treating them as stop words whereby we would ignore them for indexing and matching, we combine them with adjacent words when we build word indexes and parse word index queries.
When a common Thai word is ignored (treated as a stop word), an adjacent word that remains can have a different meaning than when combined with/adjacent to a common word. This different meaning can lead to the retrieval of irrelevant records. In cases where the meaning of a word would have changed had we removed the adjacent common word, combining it with the adjacent common word helps to disambiguate its meaning, providing greater search precision by reducing retrieval of irrelevant records.
We apply the above processing of common words when building and searching the following indexes:
- se: Series
- ti: Title
- kw: Keyword
Example
Common word treated as a stop word
- Query ti:มาตราการ (measures/procedure)
- Tokenized into มาตรา and การ
- The common word การ is removed as a stop word leaving only มาตรา
- มาตรา has a different meaning (section or clause of law) from มาตราการ (measures/procedure) and therefore retrieves records that are not relevant to the query ti:มาตราการ
Common word combined with an adjacent word
- Query ti:มาตราการ (measures/procedure)
- Tokenized into มาตรา_การ (because การ is defined as a common word)
- Records with title fields containing มาตราการ
- Titles are tokenized into มาตรา มาตรา_การ การ
- Only records containing มาตราการ are retrieved.
Thai common word list
กว่า กับ การ ก็ ขณะ ของ ความ คือ จะ จึง |
ซึ่ง ด้วย ตั้งแต่ ต่างๆ ถึง ถ้า ทั้ง ทั้งนี้ ที่ นั้น |
นี้ ว่า หรือ หาก อะไร อาจ อีก เช่น เนื่องจาก เป็นการ |
เพื่อ เมื่อ เลย เอง แต่ และ แล้ว โดย ใน ไว้ |
Sort
We sort Standard Thai author, title, and call number fields using the default collation order of the Unicode collation algorithm that we apply for all scripts and languages.
Alphabetical sorting is available in WorldCat Discovery when using the following features:
Sort search results:
- Author (A-Z)
- Title (A-Z)
The Author search filter expanded to show more:
- The Author search filter initially displays authors sorted by matching record count, highest first. Selecting the Show More option to expand the filter sorts the authors alphabetically.
- If the expanded and alphabetically sorted view includes author names in multiple scripts, names in Latin script are presented first followed by those in other scripts.
Browse the Shelf from the item details page:
- Browse the Shelf uses sorting of call numbers. Call number sorting commonly differentiates items with the same call number using an alphabetical suffix. Thus, QV772 ร451 would sort before QV772 ล148ย because ร sorts before ล.
Important links
Product website
More product information can be found here.
Office Hours
Support website(s)
Support information for this product and related products can be found at:
- WorldCat Discovery support resources
- WorldCat Discovery training
- Release notes
- OCLC customer support
- Browser compatibility chart
If you have additional questions, please contact OCLC Customer Service by calling 1-800-848-5800 or 1-614-793-8682 Monday – Friday 8 a.m. – 7 p.m. ET, or email support@oclc.org. For support enquiries in the UK and Ireland, please contact the Support Desk by calling +44-(0)114-281 60 42 or emailing support-uk@oclc.org. Support is available between the hours of 09:00 and 17:30 (UK Time).
Include Request ID with problem reports
When reporting an issue with WorldCat Discovery, it is extremely helpful to include the Request ID. The Request ID is found at the bottom of the screen on which the issue occurred. Including this information allows us to directly trace what happened on the request we are troubleshooting.