登录查看更多内容

Programming With the Artificial Computer Chauffeur

Andre Milota

Dialog Engine Architect at Huawei Technologies Social robotics research lab

发布日期: 2020年4月9日

Introduction

In this series of posts I discuss some ideas on how to use speech input in various types of programming environments. The first describes my particular take on the subject, who it could help, and outlines the basic two categories or metaphors I employ in later posts.

These ideas are aimed at creating systems which:

1) let non-programmers create at least simple programs

2) make professional programmers more productive

3) enable motor impaired individuals to write software

4) allow development of basic scripts in nontraditional environments like VR, the phone or other non-workstation systems, such as face to face interaction with robots

Speaking is four times faster than typing and speech is also more intuitive. The former can potentially improve the productivity of the professional programmer, while the latter may aid the non programmer to create programs. The fact that speech is hands-free will help those who have problems with their hands and will facilitate scripting when there is no keyboard available.

However, speech is error prone, ill suited for formal languages, and slower than pressing a button. To get the advantages of speech we need to use it properly. Natural language interactions with intelligent agents provide a potentially better way of creating programs versus merely dictating and accessing speech activated buttons. But natural language comes with its own pitfalls. It is ambiguous in many cases. Under some circumstances it is particularly difficult to resolve referents, that is to say, to determine what object is being described by a phrase. So in these posts I propose using a spatial channel and providing environments rich with visualizations that provide things to point to. This offers the potential to significantly reduce the complexity of utterances. In the case of a non-disabled user this would just be the mouse, or touch screen, or whatever other spatial channel they might have, such as a hand tracking system in VR. For professional programmers with carpal tunnel, the hope is that a large reduction of manual input would yield significant benefits. For severely motor impaired programmers we may be able to use other mouse replacement technologies such as eye tracking. I assert that just being able to use a little bit of spatial input will significantly improve the utility of this type of system and that there is very little need for fully speech driven programming.

The Computer Chauffeur

The computer chauffeur is a metaphor I like to use to describe a multimodal multi-paradigm user interface. Imagine you want to get something done using a particular computer program. Rather than spending weeks trying to become competent with that particular tool, you just find someone to operate the computer for you. In some instances you may be able to clearly delegate your task to your computer chauffeur, but in other cases the task may be much harder to convey - possibly because it is not even fully formed within your mind. In this event it may make sense to sit with them and collaboratively work on the task.

We can learn several things from observing this state of affairs. First the application’s existing interface leaves something to be desired since you were unable to pick it up immediately. Secondly, it suggests that a speech-only eyes-free interface will not suffice either. In fact this will probably be far worse in many cases. The end user needs to see the screen and quite often supplement the speech commands to the computer chauffeur with spatial input like gestures and drawing in what is called a multimodal user interface. Often this is not enough either, even with an intelligent human computer chauffeur, the end user will sometimes prefer to directly manipulate the graphical user interface themselves. We use the term multi-paradigm user interface to refer to this alternation between gesturing, drawing and direct manipulation.

The computer chauffeur brings a number of benefits to the interaction. He or she:

1) Translates natural language into the obscure actions needed to operate the GUI application. This removes or at least reduces the learning curve.

2) Translates one command into a whole series of GUI manipulations thus significantly reducing the amount of planning the end user must do.

3) Brings some knowledge of the domain to the task.

4) Performs repetitive tasks. Here the computer chauffeur is acting to extend the application.

5) Relieves the end user from having to understand the artificial higher level structures that the program employs to simplify the GUI. Instead the end user can communicate their ideas using more human friendly constructs.

I assert that many of these functions and benefits can be easily realized in an artificial computer chauffeur ACC to a degree that still retains much of the utility of working with a real human assistant. In some cases the synthetic assistant can outperform the human assistant. If it is tightly integrated with the tool it is controlling, it may be able to execute many repetitive actions instantly.

The metaphor of the computer chauffeur also supplies us with a convenient means of thinking about how such a system could perform, and also for running low fidelity studies. As an initial step in developing such a user interface, one could simply observe a human mediated human-machine interaction. At a later point we can instruct the mediator to adhere to some constraints that reflect the limited abilities of a computer. In some cases we could replace them with a team capable of performing actions very rapidly. Finally at some point, we can also hide the mediator and create a Wizard of Oz scenario.

Methods

At the fringes of the simplest tasks it may be hard to define exactly what the difference between a program and a command is. Leaving this question aside we can say that there are two general classes of ways one can communicate a program to the ACC. We can specify:

1) what should be done at various points or in response to some events. Essentially we are programming the ACC directly much as we might instruct an assistant on how to take care of a large task. We will call this approach direct agent instruction (DAI).

2) how to add to or modify some representation of a program that is displayed on the screen. We will call this approach agent mediated program editing (AMPE)

Some examples of DAI might be:

1) Play an alarm at 6pm.

2) Turn on the living room lights when motion sensor 5 detects someone.

3) Sell my stock in XYZ Corporation when its price goes below $50 a share or over $55.

4) Put everything like this <point> <point> <point> into this box<point> .

5) When you see a squirrel in the garden spray it with water until it goes away.

6) When the player goes here <point> open this door<point>.

AMPE commands look more like:

1) Put a for loop here <point>to go through this array <point>

2) Rename this variable y <point> to x and vice versa

3) Create setters and getters for all the variables in the class named “Foo”

What is in it for Non-programmers

The DAI approach is very intuitive for non- programmers though it is potentially not scalable and may be less reliable than using a formal language. But as we shall see, in the next post it could act as a bridge to helping the non-programmer learn at least a simple formal programming system and transition to a limited amount of AMPE interaction.

The right shared environment for the user and their virtual assistant can also go a long way towards increasing the power of the system as well as improving the quality of the resulting programs. Mostly, although as with many other end user programming systems, we can extend the end user’s ability to get the computer or, in the case of a robot, to do more complex things saving labor and reducing the need to bring in and deal with a human developer.

Generally we would want to let the non-programmer:

Modify existing application behaviors a little bit, going beyond merely adjusting advanced settings.
Add simple features to existing applications
Process small amounts of data like reformatting tables in one off scenarios
Set up simple automated processes

What is in it for professional programmers

For the most part I see professional programmers using an ACC in conjunction with something that looks pretty much like a current day IDE. He or she would use the DAI approach to more rapidly create test suites and create other bug traps such as conditional breakpoints. The Multimodal interaction can also potentially improve interactions with both static and dynamic code analysis tools.

Conversely, the professional programmer can use the intelligent agent interface to operate more complex code generators and refactoring tools using the AMPE approach. They may also find it useful for operating simpler code generators and refactoring tools that are currently not worth the UI overhead when operated through a GUI.

How it can assist disabled programmers

Generally a speech interaction costs a little more than a GUI interaction and formal languages are not designed for dictate-ability. If the user is forced by circumstance to use the speech, merely wrapping existing GUI and formal languages with a speech user interface might be seen as better than nothing. But if we can use speech input with an intelligent agent that can take more natural language commands and convert them into code, this would greatly improve the user experience. Such an agent can process speech commands that each do as much as 3 to 30 GUI steps.

The addition of a spatial channel goes against most work on programming by speech, but I believe that even a little bit of such input can greatly improve the environment's usability. I also see this approach as being used as a supplement to conventional keyboarding that would reduce it to a level sustainable by an injured developer. As with the regular professional developer, the disabled programmer could use the DAI approach, greatly reducing keying for these tasks as well as potentially employing AMPE but to a greater extent. In particular, I see disabled developers getting the most advantage out of many relatively simple code generators in lieu of dictation.

How to apply it in nontraditional environments

In nontraditional environments where a good keyboard is not available, DAI approaches could again be very useful in extending the reach of the non- programmer and allowing the professional to do more things without having to go back to their work station.

Next time

In the next post I will describe how a multimodally controlled agent could provide end users with a powerful and intuitive DAI means of creating somewhat complex and in some cases, even performant code.

查看更多评论

要查看或添加评论，请登录

Andre Milota的更多文章

The Potential Changing Role of Formal Languages in Chatbot Enhanced Development Workflows Part 1

2025年3月6日

The Potential Changing Role of Formal Languages in Chatbot Enhanced Development Workflows Part 1

Introduction LLM-based tools like Copilot, ChatGPT (when used for coding), and Devin are reshaping how programmers…
Multimodal Code Modification Tools With LLMs

2024年12月29日

Multimodal Code Modification Tools With LLMs

Introduction Are LLM-driven code tools the future of programming—or just overhyped bug generators? Some claim they’ll…
Challenges of building intelligent multimodal user interfaces using off the shelf LLMs (Part 1)

2024年9月3日

Challenges of building intelligent multimodal user interfaces using off the shelf LLMs (Part 1)

1 Introduction A multimodal user interface (MMUI) integrates verbal input (such as speech or text) with spatial input…
Two Phase Modality Fusion

2024年7月23日

Two Phase Modality Fusion

Introduction Multimodal systems, including user interfaces, robot controllers, or medical imaging machines, often reap…
Potential Advantages of a Speech and Gesture Multimodal User Interface for Integrated Development Environments

2024年7月17日

Potential Advantages of a Speech and Gesture Multimodal User Interface for Integrated Development Environments

Introduction In this series, I will discuss how an intelligent multimodal user interface could improve the usability of…
Rational Considerations in Choosing Speech Input VS Multimodal Input

2024年6月6日

Rational Considerations in Choosing Speech Input VS Multimodal Input

1 Introduction I would claim that many speech user interface (SUI) designers sabotage their own efforts by…
UNDO part II general implementation techniques

2023年5月12日

UNDO part II general implementation techniques

Introduction Last time I discussed the semantics of the standard UNDO mechanism found on most GUI based applications…
A Guide to Undo (Part 1 or 3)

2023年4月21日

A Guide to Undo (Part 1 or 3)

1 Abstract In this series, I will describe some of my experiences implementing complex UNDO-REDO mechanisms. In this…
End user programming with an intelligent multimodal agent in a WIMP environment

2020年4月15日

End user programming with an intelligent multimodal agent in a WIMP environment

In the prior post I started describing my visions for using a multimodal intelligent agent for programming. In this…
Turn taking in speech User Interfaces (part 3 of microphone control)

2020年3月31日

Turn taking in speech User Interfaces (part 3 of microphone control)

Introduction In the previous two posts, I discussed methods by which the user can tell a speech recognition-driven user…

See all articles

Programming With the Artificial Computer Chauffeur

Andre Milota

Dialog Engine Architect at Huawei Technologies Social robotics research lab

Introduction

The Computer Chauffeur

Methods

What is in it for Non-programmers

What is in it for professional programmers

How it can assist disabled programmers

How to apply it in nontraditional environments

Next time

Andre Milota的更多文章

社区洞察

其他会员也浏览了

The New Programming Language : English

AI Pair Programming: From Terminal to iOS

The Future of Programming

Quality Beyond Code: The Literate Programming

The Cognitive Load of Identifier Naming in Programming

Roovers: A new era in programming with AI at the core

Top 5 Programming Languages for Building AI Software in 2024

The future of programming languages

Actor Programming Language

R PROGRAMMING

Introduction

The Computer Chauffeur

Methods

What is in it for Non-programmers

What is in it for professional programmers

How it can assist disabled programmers

How to apply it in nontraditional environments

Next time

Andre Milota的更多文章

The Potential Changing Role of Formal Languages in Chatbot Enhanced Development Workflows Part 1

Multimodal Code Modification Tools With LLMs

Challenges of building intelligent multimodal user interfaces using off the shelf LLMs (Part 1)

Two Phase Modality Fusion

Potential Advantages of a Speech and Gesture Multimodal User Interface for Integrated Development Environments

Rational Considerations in Choosing Speech Input VS Multimodal Input

UNDO part II general implementation techniques

A Guide to Undo (Part 1 or 3)

End user programming with an intelligent multimodal agent in a WIMP environment

Turn taking in speech User Interfaces (part 3 of microphone control)

社区洞察

其他会员也浏览了

The New Programming Language : English

AI Pair Programming: From Terminal to iOS

The Future of Programming

Quality Beyond Code: The Literate Programming

The Cognitive Load of Identifier Naming in Programming

Roovers: A new era in programming with AI at the core

Top 5 Programming Languages for Building AI Software in 2024

The future of programming languages

Actor Programming Language

R PROGRAMMING