登录查看更多内容

Professional Editing of Seed Tasks May Make Open Source AI Smarter (part 1 of 3)

Fred Zimmerman

algorithmic book publishing. Getting too much message spam here, pls. use contact form at NimbleBooks.com if you want to chat about work.

发布日期: 2023年11月3日

+ 关注

Professional Editing of Seed Tasks May Improve the Quality of Self-Generated Instructions[1]

Fred Zimmerman, Nimble Books LLC

Correspondence: [email protected]

Abstract:

Seed tasks in synthetic data sets can have disproportionate influence and deserve close scrutiny. However, it is evident from inspecting such data sets that many tasks never receive professional editing or careful inspection by a deep expert in the domain. An inspection of 175 seed tasks in the widely adopted Alpaca synthetic data set found correctable inconsistences or errors in 87 out of the 175 tasks, or almost exactly half.? Nineteen of these, or roughly 22%, were “only” clerical in nature, i.e. misspellings; capitalization or format; the remainder had substantive flaws ranging from outright error to lack of alignment with the prompt to “merely” being boring. These findings imply that nearly half the Alpaca model’s 52,000 self-generated training items will be avoidably influenced by weak data.? Any subsequent projects drawing on this implementation of Alpaca would inherit the flaws.

Methods

A professional editor with extensive experience in both consumer and scientific publishing did a line-by-line substantive and copy edit of the 175 seed tasks in the Alpaca and Alpaca Libre repositories. The tasks were converted from JSONL format to CSV to a single large table in a Microsoft Word document with Track Changes on. The review, took approximately 8 person-hours to review 15,000 words of tasks. No automated tools were used, but that would certainly be a good idea for future exercises. When the review was completed, the document, with all mark-up showing, was saved as a PDF; it is found in Appendix 1.? To make it easier for other editors to replicate this project, a copy of the original, unedited, Word document is available in the project's repository.

Findings

It was quickly evident that 1) the tasks came from a variety of sources 2) some of the tasks were written by non-native-English speakers and 3) tasks were written with varying expectations for output quality. ?(No criticism of the Alpaca team is intended; they rightly prioritized getting new LLM capabilities to the general public.)

As the tasks were reviewed, issues were identified ad hoc, then grouped into categories.? No automated tools were used and the vocabulary was uncontrolled. The categories fell into three major clusters:

·????? Editorial

·????? Prompt/input/output match

领英推荐

The rise of open source AI

VentureBeat 4 个月前

Google Bard: Using Google Bard Effectively (Quick…

Free Online Courses With Certificates 1 年前

AI & Web Scraping Chronicles: New Lawsuits…

Oxylabs.cn 7 个月前

·????? Output quality

Editorial

A large proportion of generated content is ultimately consumed by readers who expect that what they read will conform to standards for spelling, capitalization, treatment of numerals, acronyms and abbreviations, formatting, and so on. There are different, and often competing, standards for different types of content, media platform, language and nation: the Chicago Manual of Style for books in the US, the Associated Press and New York Times style guides for newspapers, the Harvard Blue Book for legal citations, APA style for college education in the humanities, and on and on. Not only that, but many organizations have their own specialized style guides. Thus for a generative AI to "zero shot" without prompting the style standard that is most suitable to the topic, publisher and audience at hand is no easy task.? Fortunately, there is a common core that underlies all style guides: readers expect external consistency between spelling, punctuation,? and formatting and their experience and they expect internal consistency within the content.? The editorial review of the seed tasks was focused on finding issues that would be actionable under any style guide. Inconsistencies or inaccuracies within the seed tasks, if reflected in the "self-instruct" generated examples, might have a disproportionate impact on model output quality.

The first category identified was spelling. Four potential errors were found, including "Brack" for "Barack" Obama in the input for seed task 2 and UK "lustre" for the US "lustre" (86) "Confucious" (163), "seperated" (169). It's possible that it's helpful to generate examples where the model encounters common misspellings in the input and quietly corrects them in the output. The "tacit" approach in task 2 is one alternative; another alternative is to create a small number of seed tasks that explicitly teach the model how to "catch & correct"; and a third approach would be to incorporate a spelling directive in the generation prompt. Which would work best is an empirical question.

Similar considerations apply to capitalization.? Different disciplines, platforms, and languages have different capitalization conventions. However, most share common principles of restraint and consistency. The general trend in American English over the long term has been towards more sparing use of initial capitals.? There is no compelling reason for the upper casing found in tasks 2 and 15: "Night : Day :: Right : Left" and "Instability : Turmoil :: Change : Revolution". Task 19 refers to the subjects of "Data Mining" and "Machine Learning", but this is no more correct than it is to refer to "the noble art of Physics": the modern style is to lower-case. In task 68 there is an issue with the capitalization and punctuation of informal dialogue. The input includes "yeah I am looking for a toy for my son.", "output": "Waitress: sure I can do recommendations. How old is he?" The editorially correct rendition would be "Yeah, I am looking for a toy for my son", followed by "Sure, I can do recommendations."

Numbers refer to the list of seed tasks.

(parts 2 and 3 will cover issues related to input/output alignment and output quality)

[1] DRAFT prior to final review.

要查看或添加评论，请登录

Fred Zimmerman的更多文章

Show N'Tell: AcquisitionsEditor.py

2023年7月12日

Show N'Tell: AcquisitionsEditor.py

Maintaining relationships with authors is an important task for publishers, but it can be challenging, and takes a lot…
Innovating in medical equipment with Stable Diffusion and DALLE-2

2022年12月18日

Innovating in medical equipment with Stable Diffusion and DALLE-2

As frequent readers will recall, I recently had a serious and unusual bilateral quadriceps tendon rupture which…
Watch that step!

2022年9月28日

Watch that step!

Until I stepped off the curb last Tuesday, September 20, at 11:45 pm, my plan was to begin "31 Days of AI for…

1 条评论
Sprint #4: I Couldn't Compile for 12 Days!

2020年3月20日

Sprint #4: I Couldn't Compile for 12 Days!

Sprint #4 went off track when I created a bug that prevented me from compiling the AltBrains Workshop capsule for 12…
Sprint 3 retro

2020年3月6日

Sprint 3 retro

What went well: * I achieved the core objectives, i.e.
"Find AltBrains About X" sprint is complete

2020年2月17日

"Find AltBrains About X" sprint is complete

In the AltBrains capsule for Samsungs Bixby conversational personal assistant, users can now browse the AltBrains…
Announcing AltBrains Workshop

2019年11月25日

Announcing AltBrains Workshop

I'm turning the page today to a new adventure as publisher of AltBrains Workshop LLC, where I create premium content…
Three years ago today

2019年5月28日

Three years ago today

This came up under my "3 years ago today" thingie on Facebook. Results of an afternoon's bigthink about PageKicker.
Machine-payable APIs for Harry Potter facts

2016年8月1日

Machine-payable APIs for Harry Potter facts

At 21.co, the pioneering Bitcoin network, machines can buy and sell API calls?—?including, as of today, a feed of…

2 条评论
Creators Will Move Up the Value Chain in Algorithmic Publishing

2016年7月17日

Creators Will Move Up the Value Chain in Algorithmic Publishing

I received some more interesting feedback about the PageKicker publishing tool suite from independent publishing expert…

1 条评论

See all articles

Professional Editing of Seed Tasks May Make Open Source AI Smarter (part 1 of 3)

Fred Zimmerman

algorithmic book publishing. Getting too much message spam here, pls. use contact form at NimbleBooks.com if you want to chat about work.

Professional Editing of Seed Tasks May Improve the Quality of Self-Generated Instructions[1]

Abstract:

Methods

Findings

领英推荐

Editorial

Fred Zimmerman的更多文章

社区洞察

其他会员也浏览了

The Top 10’s of 2025: Open Source Frameworks and AI Agents

Embrace Open Source Generative AI: A Cost-Effective Alternative

Open Source AI as a Competitive Advantage

What are the steps to create a language translation bot on Discord?

10 Best Undetectable AI Writing Tools to Bypass AI Detection

Harnessing the Power of FME and AI to Translate Our Training Material: A Tutorial

Transformers for SEO: Enhancing Content Creation through Deep Learning

Meet the First Batch of Our Analytics Translator Bootcamp Graduates

What is the Best Free AI Writer for Beginners?

Importance of JATS XML For Publishers

Professional Editing of Seed Tasks May Improve the Quality of Self-Generated Instructions[1]

Abstract:

Methods

Findings

领英推荐

Editorial

Fred Zimmerman的更多文章

Show N'Tell: AcquisitionsEditor.py

Innovating in medical equipment with Stable Diffusion and DALLE-2

Watch that step!

Sprint #4: I Couldn't Compile for 12 Days!

Sprint 3 retro

"Find AltBrains About X" sprint is complete

Announcing AltBrains Workshop

Three years ago today

Machine-payable APIs for Harry Potter facts

Creators Will Move Up the Value Chain in Algorithmic Publishing

社区洞察

其他会员也浏览了

The Top 10’s of 2025: Open Source Frameworks and AI Agents

Embrace Open Source Generative AI: A Cost-Effective Alternative

Open Source AI as a Competitive Advantage

What are the steps to create a language translation bot on Discord?

10 Best Undetectable AI Writing Tools to Bypass AI Detection

Harnessing the Power of FME and AI to Translate Our Training Material: A Tutorial

Transformers for SEO: Enhancing Content Creation through Deep Learning

Meet the First Batch of Our Analytics Translator Bootcamp Graduates

What is the Best Free AI Writer for Beginners?

Importance of JATS XML For Publishers