Professional Editing of Seed Tasks May Make Open Source AI Smarter (part 1 of 3)
Fred Zimmerman
algorithmic book publishing. Getting too much message spam here, pls. use contact form at NimbleBooks.com if you want to chat about work.
Professional Editing of Seed Tasks May Improve the Quality of Self-Generated Instructions[1]
Fred Zimmerman, Nimble Books LLC
Correspondence: [email protected]
Abstract:
Seed tasks in synthetic data sets can have disproportionate influence and deserve close scrutiny. However, it is evident from inspecting such data sets that many tasks never receive professional editing or careful inspection by a deep expert in the domain. An inspection of 175 seed tasks in the widely adopted Alpaca synthetic data set found correctable inconsistences or errors in 87 out of the 175 tasks, or almost exactly half.? Nineteen of these, or roughly 22%, were “only” clerical in nature, i.e. misspellings; capitalization or format; the remainder had substantive flaws ranging from outright error to lack of alignment with the prompt to “merely” being boring. These findings imply that nearly half the Alpaca model’s 52,000 self-generated training items will be avoidably influenced by weak data.? Any subsequent projects drawing on this implementation of Alpaca would inherit the flaws.
Methods
A professional editor with extensive experience in both consumer and scientific publishing did a line-by-line substantive and copy edit of the 175 seed tasks in the Alpaca and Alpaca Libre repositories. The tasks were converted from JSONL format to CSV to a single large table in a Microsoft Word document with Track Changes on. The review, took approximately 8 person-hours to review 15,000 words of tasks. No automated tools were used, but that would certainly be a good idea for future exercises. When the review was completed, the document, with all mark-up showing, was saved as a PDF; it is found in Appendix 1.? To make it easier for other editors to replicate this project, a copy of the original, unedited, Word document is available in the project's repository.
Findings
It was quickly evident that 1) the tasks came from a variety of sources 2) some of the tasks were written by non-native-English speakers and 3) tasks were written with varying expectations for output quality. ?(No criticism of the Alpaca team is intended; they rightly prioritized getting new LLM capabilities to the general public.)
As the tasks were reviewed, issues were identified ad hoc, then grouped into categories.? No automated tools were used and the vocabulary was uncontrolled. The categories fell into three major clusters:
·????? Editorial
·????? Prompt/input/output match
领英推荐
·????? Output quality
Editorial
A large proportion of generated content is ultimately consumed by readers who expect that what they read will conform to standards for spelling, capitalization, treatment of numerals, acronyms and abbreviations, formatting, and so on. There are different, and often competing, standards for different types of content, media platform, language and nation: the Chicago Manual of Style for books in the US, the Associated Press and New York Times style guides for newspapers, the Harvard Blue Book for legal citations, APA style for college education in the humanities, and on and on. Not only that, but many organizations have their own specialized style guides. Thus for a generative AI to "zero shot" without prompting the style standard that is most suitable to the topic, publisher and audience at hand is no easy task.? Fortunately, there is a common core that underlies all style guides: readers expect external consistency between spelling, punctuation,? and formatting and their experience and they expect internal consistency within the content.? The editorial review of the seed tasks was focused on finding issues that would be actionable under any style guide. Inconsistencies or inaccuracies within the seed tasks, if reflected in the "self-instruct" generated examples, might have a disproportionate impact on model output quality.
The first category identified was spelling. Four potential errors were found, including "Brack" for "Barack" Obama in the input for seed task 2 and UK "lustre" for the US "lustre" (86) "Confucious" (163), "seperated" (169). It's possible that it's helpful to generate examples where the model encounters common misspellings in the input and quietly corrects them in the output. The "tacit" approach in task 2 is one alternative; another alternative is to create a small number of seed tasks that explicitly teach the model how to "catch & correct"; and a third approach would be to incorporate a spelling directive in the generation prompt. Which would work best is an empirical question.
Similar considerations apply to capitalization.? Different disciplines, platforms, and languages have different capitalization conventions. However, most share common principles of restraint and consistency. The general trend in American English over the long term has been towards more sparing use of initial capitals.? There is no compelling reason for the upper casing found in tasks 2 and 15: "Night : Day :: Right : Left" and "Instability : Turmoil :: Change : Revolution". Task 19 refers to the subjects of "Data Mining" and "Machine Learning", but this is no more correct than it is to refer to "the noble art of Physics": the modern style is to lower-case. In task 68 there is an issue with the capitalization and punctuation of informal dialogue. The input includes "yeah I am looking for a toy for my son.", "output": "Waitress: sure I can do recommendations. How old is he?" The editorially correct rendition would be "Yeah, I am looking for a toy for my son", followed by "Sure, I can do recommendations."
?
(parts 2 and 3 will cover issues related to input/output alignment and output quality)
[1] DRAFT prior to final review.