Professional Editing of Seed Tasks May Make Open Source AI Smarter (part 1 of 3)

Professional Editing of Seed Tasks May Improve the Quality of Self-Generated Instructions[1]

Fred Zimmerman, Nimble Books LLC

Correspondence: [email protected]

Abstract:

Seed tasks in synthetic data sets can have disproportionate influence and deserve close scrutiny. However, it is evident from inspecting such data sets that many tasks never receive professional editing or careful inspection by a deep expert in the domain. An inspection of 175 seed tasks in the widely adopted Alpaca synthetic data set found correctable inconsistences or errors in 87 out of the 175 tasks, or almost exactly half.? Nineteen of these, or roughly 22%, were “only” clerical in nature, i.e. misspellings; capitalization or format; the remainder had substantive flaws ranging from outright error to lack of alignment with the prompt to “merely” being boring. These findings imply that nearly half the Alpaca model’s 52,000 self-generated training items will be avoidably influenced by weak data.? Any subsequent projects drawing on this implementation of Alpaca would inherit the flaws.

Methods

A professional editor with extensive experience in both consumer and scientific publishing did a line-by-line substantive and copy edit of the 175 seed tasks in the Alpaca and Alpaca Libre repositories. The tasks were converted from JSONL format to CSV to a single large table in a Microsoft Word document with Track Changes on. The review, took approximately 8 person-hours to review 15,000 words of tasks. No automated tools were used, but that would certainly be a good idea for future exercises. When the review was completed, the document, with all mark-up showing, was saved as a PDF; it is found in Appendix 1.? To make it easier for other editors to replicate this project, a copy of the original, unedited, Word document is available in the project's repository.

Findings

It was quickly evident that 1) the tasks came from a variety of sources 2) some of the tasks were written by non-native-English speakers and 3) tasks were written with varying expectations for output quality. ?(No criticism of the Alpaca team is intended; they rightly prioritized getting new LLM capabilities to the general public.)

As the tasks were reviewed, issues were identified ad hoc, then grouped into categories.? No automated tools were used and the vocabulary was uncontrolled. The categories fell into three major clusters:

·????? Editorial

·????? Prompt/input/output match

·????? Output quality

Editorial

A large proportion of generated content is ultimately consumed by readers who expect that what they read will conform to standards for spelling, capitalization, treatment of numerals, acronyms and abbreviations, formatting, and so on. There are different, and often competing, standards for different types of content, media platform, language and nation: the Chicago Manual of Style for books in the US, the Associated Press and New York Times style guides for newspapers, the Harvard Blue Book for legal citations, APA style for college education in the humanities, and on and on. Not only that, but many organizations have their own specialized style guides. Thus for a generative AI to "zero shot" without prompting the style standard that is most suitable to the topic, publisher and audience at hand is no easy task.? Fortunately, there is a common core that underlies all style guides: readers expect external consistency between spelling, punctuation,? and formatting and their experience and they expect internal consistency within the content.? The editorial review of the seed tasks was focused on finding issues that would be actionable under any style guide. Inconsistencies or inaccuracies within the seed tasks, if reflected in the "self-instruct" generated examples, might have a disproportionate impact on model output quality.

The first category identified was spelling. Four potential errors were found, including "Brack" for "Barack" Obama in the input for seed task 2 and UK "lustre" for the US "lustre" (86) "Confucious" (163), "seperated" (169). It's possible that it's helpful to generate examples where the model encounters common misspellings in the input and quietly corrects them in the output. The "tacit" approach in task 2 is one alternative; another alternative is to create a small number of seed tasks that explicitly teach the model how to "catch & correct"; and a third approach would be to incorporate a spelling directive in the generation prompt. Which would work best is an empirical question.

Similar considerations apply to capitalization.? Different disciplines, platforms, and languages have different capitalization conventions. However, most share common principles of restraint and consistency. The general trend in American English over the long term has been towards more sparing use of initial capitals.? There is no compelling reason for the upper casing found in tasks 2 and 15: "Night : Day :: Right : Left" and "Instability : Turmoil :: Change : Revolution". Task 19 refers to the subjects of "Data Mining" and "Machine Learning", but this is no more correct than it is to refer to "the noble art of Physics": the modern style is to lower-case. In task 68 there is an issue with the capitalization and punctuation of informal dialogue. The input includes "yeah I am looking for a toy for my son.", "output": "Waitress: sure I can do recommendations. How old is he?" The editorially correct rendition would be "Yeah, I am looking for a toy for my son", followed by "Sure, I can do recommendations."


Numbers refer to the list of seed tasks.

?

(parts 2 and 3 will cover issues related to input/output alignment and output quality)


[1] DRAFT prior to final review.

要查看或添加评论,请登录

Fred Zimmerman的更多文章

  • Show N'Tell: AcquisitionsEditor.py

    Show N'Tell: AcquisitionsEditor.py

    Maintaining relationships with authors is an important task for publishers, but it can be challenging, and takes a lot…

  • Innovating in medical equipment with Stable Diffusion and DALLE-2

    Innovating in medical equipment with Stable Diffusion and DALLE-2

    As frequent readers will recall, I recently had a serious and unusual bilateral quadriceps tendon rupture which…

  • Watch that step!

    Watch that step!

    Until I stepped off the curb last Tuesday, September 20, at 11:45 pm, my plan was to begin "31 Days of AI for…

    1 条评论
  • Sprint #4: I Couldn't Compile for 12 Days!

    Sprint #4: I Couldn't Compile for 12 Days!

    Sprint #4 went off track when I created a bug that prevented me from compiling the AltBrains Workshop capsule for 12…

  • Sprint 3 retro

    Sprint 3 retro

    What went well: * I achieved the core objectives, i.e.

  • "Find AltBrains About X" sprint is complete

    "Find AltBrains About X" sprint is complete

    In the AltBrains capsule for Samsungs Bixby conversational personal assistant, users can now browse the AltBrains…

  • Announcing AltBrains Workshop

    Announcing AltBrains Workshop

    I'm turning the page today to a new adventure as publisher of AltBrains Workshop LLC, where I create premium content…

  • Three years ago today

    Three years ago today

    This came up under my "3 years ago today" thingie on Facebook. Results of an afternoon's bigthink about PageKicker.

  • Machine-payable APIs for Harry Potter facts

    Machine-payable APIs for Harry Potter facts

    At 21.co, the pioneering Bitcoin network, machines can buy and sell API calls?—?including, as of today, a feed of…

    2 条评论
  • Creators Will Move Up the Value Chain in Algorithmic Publishing

    Creators Will Move Up the Value Chain in Algorithmic Publishing

    I received some more interesting feedback about the PageKicker publishing tool suite from independent publishing expert…

    1 条评论

社区洞察

其他会员也浏览了