ChatGPT-4 Can Help Hand Surgeons Communicate Better With Patients

The American Society for Surgery of the Hand and British Society for Surgery of the Hand produce patient-focused information above the sixth-grade readability recommended by the American Medical Association. To promote health equity, patient-focused content should be aimed at an appropriate level of health literacy. Artificial intelligence–driven large language models may be able to assist hand surgery societies in improving the readability of the information provided to patients. The readability was calculated for all the articles written in English on the American Society for Surgery of the Hand and British Society for Surgery of the Hand websites, in terms of seven of the commonest readability formulas. Chat Generative Pre-Trained Transformer version 4 (ChatGPT-4) was then asked to rewrite each article at a sixth-grade readability level. The readability for each response was calculated and compared with the unedited articles. Chat Generative Pre-Trained Transformer version 4 was able to improve the readability across all chosen readability formulas and was successful in achieving a mean sixth-grade readability level in terms of the Flesch Kincaid Grade Level and Simple Measure of Gobbledygook calculations. It increased the mean Flesch Reading Ease score, with higher scores representing more readable material. This study demonstrated that ChatGPT-4 can be used to improve the readability of patient-focused material in hand surgery. However, ChatGPT-4 is interested primarily in sounding natural, and not in seeking truth, and hence, each response must be evaluated by the surgeon to ensure that information accuracy is not being sacrificed for the sake of readability by this powerful tool.

The American Society for Surgery of the Hand (ASSH) and British Society for Surgery of the Hand (BSSH) have both developed patient-focused information about common hand conditions and their management on their respective websites. 1,2To be effective, this content should be understandable to patients of all levels of health literacy.The majority of Americans have a reading level of around sixth to eighth grade. 3Because of this, the American Medical Association recommends that all patient-focused educational materials be written at or below a sixth-grade readability level. 3This helps ensure that patients are properly informed about their condition and treatment options and promotes health equity.The issue of patient-directed content in hand surgery being aimed at a readability level above that recommended by the American Medical Association has been highlighted before, 4 with little improvement seen over the past decade. 5hat Generative Pre-Trained Transformer version 4 (ChatGPT-4) is an artificial intelligence (AI)edriven large language model that has shown great promise in many disparate fields, including health care, since its release by OpenAI (OpenAI Inc).Although it is rapidly progressing, ChatGPT has yet to surpass residents' ability to pass hand surgery examinations 6 and produces relatively less readable 7 and less reliable 8 patient-focused content than is available on professional hand society websites.However, ChatGPT has shown promise in improving the readability of patient education materials in many surgical specialties, including orthopedic surgery, when prompted to do so.9 The aim of this study was to evaluate the ability of ChatGPT-4 to simplify patient-focused hand surgery information provided on the ASSH and BSSH websites to a sixth-grade readability level.

Methods
All articles written in English from the websites of the ASSH 1 and BSSH 2 were downloaded and converted to plain with all figures and figure legends removed.

Readability calculation
A freely available online readability calculator, available at http://www.readabilityformulas.com, was used to calculate the readability of the included articles, before and after ChatGPT-4 alteration.This calculator determines the readability of a given piece of text based on seven commonly used readability formulas: the Automated Readability Index, Gunning Fog Score, Flesch Kincaid Grade Level (FKGL), Flesch Reading Ease (FRE), Coleman-Liau Index, Simple Measure of Gobbledygook (SMOG), and Linsear Write Formula.All of these formulas express readability in terms of grade level, with the exception of the FRE, which reports an index score.Higher index scores represent more readable material, with a score of 100 representing the most readable and 0 representing the least readable material.

ChatGPT-4 alteration
The plain text articles were copied into ChatGPT-4 (https://chat.openai.com/),preceded by the prompt "Rewrite the following text at a sixth-grade readability level."The first response to each prompt was copied, and readability was calculated as described above.

Statistical analysis
Data were collected and stored in Microsoft Excel (Microsoft Corp) using a predefined proforma to include all calculated readability grade levels, index scores, and word counts, both pre-and post-ChatGPT-4 alteration.Descriptive statistics were generated for grade level measures of central tendency.Paired Student t tests were used to compare the pre-and post-ChatGPT-4 grade level and index scores.Significance level was set at 5%, and a two-tailed test significance was used.

Results
One hundred seventeen articles were included in the final analysis, 95 published by the ASSH and 22 published by the BSSH.Table 1 displays the unedited and chatGPT-4eedited mean grade levels based on each calculated readability formula.The mean decrease after ChatGPT-4 alteration was two grade levels.A readability level of grade nine was the mode in the unedited articles, with a mode of six in the ChatGPT-4eedited articles.The mean (± SD) FRE in the pre-and post-ChatGPT-4eedited text samples was 61.6 (±7.5) and 76.9 (± 8.2), respectively (P < .001).The mean word counts were 728.3 unedited and 303.6 for the ChatGPT-4eedited articles (P < .001).Table 2 shows a typical example of an article and the corresponding ChatGPT-4 version, with the mean readability of the six formulas that report grade level displayed.

Discussion
This article highlights the potential of AI to promote equity in the care of hand surgery patients by adjusting patient-focused materials to a readability level appropriate to their level of health literacy.Chat Generative Pre-Trained Transformer version 4 was successful in terms of FKGL and SMOG in simplifying the included texts to a mean sixth-grade readability level but unsuccessful for the four other included readability tests that report grade level.There was a statistically significant mean decrease of two grade levels per article, however.
Kirchner et al 9 evaluated the ability of an earlier, free version of ChatGPT (9 January 2023, version 3.5) to simplify online patient educational materials about common orthopedic conditions and procedures.They prompted ChatGPT-3.5 to "translate to fifth-grade reading level(sic)" and found that the median FKGL for included articles about lumbar disc herniation, scoliosis, and spinal stenosis decreased from 9.5, 12.6, and 10.9 to 5.0, 5.6, and 6.9, respectively. 9hey found similar results for total hip and knee arthroplasty. 9irchner et al also commented that ChatGPT3.5 achieved improved readability without any factual inaccuracies while retaining sufficient information to allow patients to make treatment decisions.This assertion was reached by each author independently reviewing the AI-converted text and coming to a conclusion, without reference to a particular quality assessment instrument.
Rouhi et al 10 used a similar method of "independent review by multiple investigators" to identify factual errors or inaccuracies in ChatGPT-3.5 adjusted patient education materials about aortic stenosis and could find none.They similarly found a baseline higher than the recommended readability level in online patient-focused materials in aortic stenosis and a trend toward improved readability after ChatGPT-3.5 alteration.They used the FRE, FKGL, GFI, and SMOG readability scores and found that the FRE and FKGL reached their sixth-grade readability threshold. 10his study, as with those discussed above, is limited by the lack of an objective method of assessing the quality of material produced by ChatGPT-4.Some authors advocate tools such as DISCERN (http://www.discern.org.uk) to assess information quality in this context.This instrument is limited in the authors' opinion as it was not validated for use in AI-generated text.Two of the 16 questions in DISCERN ask whether the sources are cited and if the date of production is clear.These are clearly irrelevant in the case of a large language model like ChatGPT-4, which has been trained on a vast quantity of source material, is constantly progressing in response to inputs, and produces answers on demand.Future work could focus on producing an instrument validated in this context.
Hand surgery is an innovative specialty, and hand surgeons are often keen to adopt new technologies to improve patient care and outcomes.However, ChatGPT-4 is optimized to provide naturalsounding responses to prompts and is not necessarily truthseeking.Indeed, a footnote on the ChatGPT website reads: "ChatGPT can make mistakes.Consider checking important information."As AI technology continues to rapidly evolve, it is likely that the barrier to entry will be lowered for individual organizations like the ASSH and BSSH to create personalized chatbots that can interact with patients, providing accurate information at a readability level appropriate to their level of health literacy.Until then, this study has demonstrated that ChatGPT-4 can be a powerful tool in improving the readability of patient-focused educational materials, but it remains incumbent upon individual surgeons who use this technology to ensure the accuracy of the information provided.

Declaration of Generative AI and AI-assisted Technologies in the Writing Process
During the preparation of this work, the authors used ChatGPT-4 in order to adjust the articles of text as explicitly described in the manuscript above but was not used in drafting or adjusting the remaining manuscript.After using this tool/service, the authors reviewed the content and analyzed it as described in the manuscript and take full responsibility for the content of the publication.

Table 1
Mean Grade Level Per Readability Formula * Expressed in terms of grade level.