Evaluation As Feedback Cycle
This post on Best Of A Great Lot is a part of a series on the subject of designing a new form of governance. Each piece aims to stand alone, but fits together on the Table of Contents.
Previous: A Sketch of Belocracy. Next: Generating Ideas.
We know the path by which optimizers find effectiveness: they follow feedback loops on a signal of quality. This is a core truth of capitalism, evolution, machine learning, our most successful humans and other systems that optimize in response to feedback. Belocracy aims to join these ranks by having an evaluation step and then providing direct incentives for succeeding at that evaluation step.
As people and as a society, we are constantly evaluating: journalism, academic research, political arguments and individual conversations at least sometimes refer to evidence we have about how well our decisions have panned out. But these informal, noncentralized evaluations serve as only a weak feedback mechanism compared to what formal, centralized, independent evaluations can be. Within belocracy, policies are reviewed to determine whether they are meeting their goals, and programs -- whether regulatory or infrastructure -- are evaluated on how well they are succeeding in comparison to their competitors. Rewards and reputation are handed out for successful outcomes. Individuals pursuing their own benefit drives improvement in the system, and as long as the feedback offered is closely aligned with reality, a belocratic system will improve on the back of these rewards.
It's very tempting to delay talking about evaluation until later. Intuitively, the evaluation of a thing comes after the thing itself, so it might seem obvious to talk about how policies and programs are made before talking about how they're evaluated. But the feedback loop of belocracy is the heart of what gives it a chance of letting us make better policies and run our infrastructure more effectively. Without a feedback loop, the decision of what to do always comes down to what sounds better politically or other, worse feedback loops like prestige and money, rather than what works better in practice. So I'm going to describe the evaluation system first and defer to later some parts of the system that evaluation depends upon.
A system of evaluation carries inherent risk: it can become a target for ideologues and opportunists who want power. Supreme Court seats and the ownership of major media outlets both offer important lessons here, as they both do something a little like evaluation, and they are both hotly sought after by partisans. Orwell identified the government-approved declaration of Truth as a critical threat to our freedom in society. If government has decided that one direction is True, skeptics become a threat to the peace and civility of society. Meanwhile the postmodernists would like to point out (being charitable to them) that the concept of truth is often used to cement privileged perspectives. Many social truths look very different from less privileged views.
So we have to tackle evaluation up front. A belocratic evaluation system that can't do a fair enough job, isn't sufficiently protected from ideological conquest or functions to stifle the dissent of the less powerful has potential to be worse than no evaluation system. If belocracy is to be a viable system, it must achieve the twin goals of a useful feedback loop that self-corrects against abuse. At the same time, it is easy to allow the perfect to be the enemy of the good. Despite Orwell’s warnings of a Ministry of Truth, we still allow a justice system with its system of judgement and determination. Why?
Our justice system allows us to settle disputes better than if we did not have it, without being in any way perfect at it or always living up to its own ideals. Critically, it has several things that allow it to not become the Ministry of Truth, including a culture of independence and neutrality, the appeals process, and the careful limits on what is acceptable to be reviewed. Politicians have few tools to punish judges for judgements they dispute, and judges believe that they are supposed to be independent and neutral. The appeals process ensures that no one person’s judgement is final, and more judges are involved at each higher level. Perhaps most importantly, you can’t ask a court to review any topic you want and pass judgement, the justice system views itself as having very strict limits on what is in scope for its role. As a result of these and other factors, the culture of the justice system centers arguments and reasons more often than it does power and politics, and partisans who expect partisan outcomes are often surprised by the actual opinions judges issue.
Feedback cycles succeed when a) the feedback is good and b) the feedback is used to drive a new generation of improvements. At their best, feedback cycles are extremely powerful. Capitalism has generated enormous amounts of profit in the last half millennia. Evolution has generated an incredibly wide array of novel things which survive and procreate over the last 3 billion years. Machine learning is newer than the other two, but has also had remarkable success with “simple” gradient descent.
There's no natural judgement that we can bootstrap on. No god out there sits on a cloud and speaks their determination of whether we've done well with our actions, and there's no wise person we all agree to listen to on the subject. To try to pass evaluation off to a computer program is merely to implicitly trust the programmers. To pass it off to a metric is to demand we agree on the metric first, and then trust the measurers and hope that Goodhart won’t win. We have to use regular humans if we want this system to exist. So the whole of belocracy needs to work together to enable the evaluation system to do well at grading the outcomes of our choices and setting the incentives that induce us to improve in the future.
In economics, price is called a signal because it provides information throughout the economy that allows for distributed decisionmaking. Jeff Bezos likes to say "your margin is my opportunity", by which he means that if someone is delivering a product at high prices, other entrepreneurs analyze how much they would need to deliver a competing product, and if they can deliver it cheaper, the price has drawn them into the market. Over time, assuming there aren't other barriers to entering the market such as network effects or IP law, high prices stimulate the competition necessary to drive them down.
In belocracy, evaluations function as a signal in a similar way for society, encouraging citizens to improve things however they can figure out how to improve them. Organizations and policies that succeed receive positive evaluations which earn their creators and employees reputation and monetary rewards. Meanwhile, policies that fail are repealed and are an opportunity for policy designers to create new ones. Organizations which are doing poorly are a target for social entrepreneurs to create new, more effective ones.
Let's dive into how the evaluation system works.
What gets evaluated?
We don't want the belocratic evaluation system to be open to evaluate anything. A justice system which would accept any disagreement would quickly devolve into chaos. An evaluation system which is open to any question would too. Also, both of them would rapidly violate the small-l liberal goal of limiting the role of government in life. The belocratic evaluation system is only intended to judge the outcomes of the decisions of government. Specifically, evaluation panels review three things to build feedback into the system:
Were the policies we passed effective at improving the problems they were intended to improve? Did they impose excessive costs we weren’t intending to accept?
How well are the organizations that manage our regulatory, enforcement and societal infrastructure performing?
Which citizen contributions of ideas and information to both the policy design and evaluation process have been most useful?
With policies, we’re determining whether to roll them out further, leave them as they are until something better comes along, or revoke them. With organizations, we’re setting them to compete against each other to do better at their goals for society. With citizen contributions, we’re incentivizing the citizenry to tell us the facts on the ground of society, to come up with ideas of what to do about problems and to find useful arguments and evidence in favor of one path or another.
Who evaluates?
At the heart of the evaluation system are those who evaluate, who I have named valors. To be selected as a valor you must already have a deep reputation built by researching or crafting successful policies, or running societal infrastructure effectively. From that pool of people who are at the very top of service to their country, belocracy runs a SIEVE to select new valors. The voters in the SIEVE can be randomly selected from the top half of society by reputation, but the candidates are randomly selected from the top 5%. It should be rare that anyone makes it into the ranks of valor without having a long track record of contributing successfully to making society better. The randomness of the SIEVE keeps this from being a gameable path to power1, so those whose primary motivation is power are likely to look elsewhere for an easier path. Being rich and famous won't give you a leg up unless you are also great at delivering for society. The way to game the system to gain this prestige is to actually improve things.
Valors are selected into panels or teams who work together to evaluate topics within a small set of overlapping subject areas. A panel might be responsible for all programs and policies related to agriculture, or defense, or they might be responsible for a small set of topics across several areas, for example technology innovation across defense, transportation and energy. The number of evaluation panels isn't set in stone: a policy can be proposed to increase or combine evaluation panels just as any other policy is proposed. Since policies are reviewed by evaluation teams for their success, each policy needs outcomes it’s aiming to improve. A proposal to increase or combine panels could aim at backlog length, speed and quality of evaluations.
Being effective as a valor requires some depth across a wide variety of domains, since many scientific areas are meaningful for understanding policy. Ensuring that there's constantly the option of continuing education for valors is a good practice, but isn't enough to keep them sharp. Some form of competition is necessary to keep them from being as set in their ways as some old scientists or judges. Some of the most important skills for a valor are the ability to see through falsehood and rhetoric and the ability to genuinely understand a perspective that is outside their own preferred ideology, as well as the more obvious mathematical and scientific literacy to be able to delve into the truth. The valors as a whole run an Evaluation Olympiad every few years where sitting valors compete at these and other relevant skills. The Olympiad also lets us as a society judge them. Valors who do poorly are smart to retire before they are removed for it.
How do Evaluations Run?
Evaluation panels plan their dockets based on the policies and programs which are relevant to their arenas. Policies, programs (organizations) and some petitions can all end up on an evaluation panel's docket based on what arenas of society they work on.
When policies are written, they include several things that are unusual in our current legislature. The first is an explicit set of goals: what we're hoping the policy will achieve. The second is an iterative, experimental rollout plan with checkpoints in time which describes how the policy will be implemented in stages. After each stage, the policy is evaluated by an evaluation panel to determine whether the rollout should be continued, paused, modified, or reversed. Policies which are written without these things are an opportunity for the evaluation panel to make them up, which is a dangerous game for policy designers who will be paid based on these evaluations.
Evaluation panels have freedom and flexibility to go beyond what was originally conceived. Policy designers could write in to point out how an expanded food stamp program has improved child health outcomes in ways that weren't considered at the time of passage. Libertarians could point out that the goal of reducing terrorism may have made a lot of sense to the electorate in the aftermath of 9/11, but ten years out, seems less important compared to the constant intrusions of the state on our civil liberties. Evaluation panels have a chance to rewrite goals to better serve society when it makes sense to them. If they go too far they can run afoul of the mechanisms by which evaluation panels are held accountable.
Societally run programs - infrastructure, regulatory, enforcement, etc - are also on a regular schedule of evaluations. Each program is run by a set of organizations called societal benefit organizations, whose structure we’ll delve into later. These organizations are in competition with each other to receive high marks in their evaluation and particularly bad marks can lead to the organization shutting down.
Evaluation panels manage their docket independently. They can prioritize things sooner or hold onto them for a few months depending on feedback from other panels, researchers and the public. As they set dates for each of these evaluations, this triggers the opportunity for citizens to contribute citizen arguments, equivalent to the amicus curiae that organizations sometimes write to our courts about topics they care deeply about. In the case of evaluations, citizen arguments are at the heart of the process, rather than an addition. There is no plaintiff and no defendant in an evaluation, only a search for truth which can be contributed to by anyone in the country. The evaluation panels have their own staff of researchers who investigate topics as well, both to find things independently and to delve deeper into the citizen arguments which are received. Citizen argument is one of the core self-governance features of belocracy: a chance to persuade the evaluation panel and as a result, the country, to your point of view.
When the panel sets out to evaluate whether a policy is succeeding, they expect citizens to write in with their experiences, academics to write in with their research, and the policy designers themselves to add to the mix. This is a lot of material, and one of the core components of belocracy is a data wrangling system whose job is to highlight the best of these so that it's possible for the entire citizenry to be able to contribute and yet allow the valors to actually do their job. We'll cover the belocratic data system in a later chapter. Many of these people will be heavily biased, but that's the reality of people. We are all biased in various directions, some easy to notice and some much harder. We are biased by our experiences, our profession, our ideologies, our future hopes and dreams, and the fundamental underlying nature of our brains. Our current system hides the biases of those who provide inputs by encouraging the inputs to happen in backrooms. Belocracy demands these inputs be public so that the biases can be visible and pointed out. The panel's job is to review the best of the commentary that comes in from the belocratic data system and, just as judges must come to a conclusion from what is presented to them, come to a conclusion about how this aspect of governance is performing for us. Just like a judicial opinion, panels will release a majority opinion, and some panel members may well write dissents. Belocracies can and should require supermajority opinions for some things, such as expanding a controversial experiment or terminating a popular program.
Incentives
As with trial judges, evaluation panels hand out two things separately. With a court it's a verdict and a sentence, with an evaluation panel it's an evaluation and a reward or penalty. Rewards are a pot of money and reputation which are distributed by belocracy to those who made the success happen. Penalties are a pot of debt and reputational penalties which are distributed to those who drove failing policies. Penalties aren't due immediately, but rather reduce future rewards, to ensure that those who aren’t currently wealthy can still contribute. Not everyone who participated in the policy or program which has been deemed to have succeeded or failed receives rewards or penalties - the money is generally divided among those deemed to have contributed the most. Everyone receives the reputation or reputation penalty, though it’s scaled to contribution. We'll talk further about reputation in the chapter on the belocratic data system and a separate chapter.
Success and failure aren’t always as straightforward as it might seem. Risk-taking is an essential component of learning and learning is at the heart of what makes a feedback loop like the evaluation system work. When belocracy tries out well-designed experiments that we learn from rapidly, the researchers, policy designers, juries, and belocrats should be rewarded, whether the outcomes are positive or negative. When they pass policies that cause harms that were predicted but they ignored, or they pass packages of things that don't go together, or they prefer really complicated things to simpler options, they should be penalized. Evaluation panels should start by looking at the benefits that accrued from the policy and the harms that were caused. They should give greater rewards to policies where society clearly thinks the harms are worth the benefits, and where the harms are evenly spread across society or concentrated on those who can tolerate them best. Policies that push harms onto already downtrodden communities and policies that cause excessive harms should be penalized and repealed. And evaluation teams can choose to hand out rewards for well-designed small-scale experiments that demonstrated something valuable, even if it might otherwise be called a failure.
Belocracy also encourages risk-taking by scaling penalties and rewards to the size of the impact, naturally creating incentives to do small scale experiments first. Imagine someone who drives the creation and passage of a major policy. They may have needed half a dozen failing experiments to learn essential aspects of the policy landscape before they find the right combination. Each of those small experiments will earn at most a small penalty, but the final successful policy will roll out to everyone and earn rewards at each step of its growth, and the last steps will be much larger. One big hit can make up for hundreds of small failures.
Of course many policies have tradeoffs, and many policies do not count as unqualified successes. The question that drives whether a policy is considered a success or a failure is whether it meets its goals enough that we should keep it in place until we find something better. Setting a bar for this is part of the job of the policy jury which passed the policy, and we'll return to that topic in more detail in another section. An evaluation panel can also scale the reward or penalty based on how successful the policy is. If we get something that is better than nothing, but we are going to be actively looking for a replacement, that might qualify for a smaller reward pool than an obvious win.
Sizing rewards is a difficult thing, especially with potentially many people contributing to a policy. The town which agreed to be an experiment for the policy, the belocrat who oversaw the area, the policy designer who put together the policy, the researchers and citizens who contributed ideas, and the policy jury which decided to take a chance on the policy: all of these are potential recipients of rewards. Belocracy assigns a set of small societal benefit organizations to manage the official scales with reference policies and rewards. Perhaps the creation of social security is at the high end of enormous successes, and the decision by the FDA to allow telemedicine during the pandemic is a small victory. Each gets a number, and if you wish to dispute the choices of these organizations, you can argue it at their evaluation. With people dedicated to the task and competing over the outcomes, they'll come up with great reference points that make it easy for evaluation panels to hand down consistent rewards.
Panels can run evaluations and decide rewards any way they want, though I recommend they experiment with splitting the panel in two and having one side decide who receives what percent and having the other side size the reward or penalty itself. Their internal workings aren't set in stone, in part because different panels will work differently, and in part because valors will, through practice, experiment and explore how best to approach these problems.
This all might seem utopian from the standpoint of our current society. Across the ideological spectrum there is rampant disagreement on whether policies in place help or hurt more and what we should do about it. I think it's worth comparing to the way that we interpret the law. There's often rampant disagreement about how the law should be interpreted, but as courts weigh in, the parts of society that lost mostly grumble and move on. We tolerate those interpretations because they provide the clarity of knowing what to expect. In the rare instances where the courts decide in ways that society overwhelmingly disagrees with, there is a mechanism (Congress) to overturn the courts' decision. Evaluation panels serve the same sense-making purpose: they create a visible determination of what is working, providing reasons and arguments for their opinion.
Their opinions can be appealed, both to other evaluation panels and to a petition jury made up of regular citizens. Ultimately, though, this process has an endpoint, and society either has to accept it and move on, or propose a change that receives broad agreement. This is the nature of self-governance at scale - we're not all going to be happy all of the time, but more of us get to have a say than in other systems. This is one of the things people mean when they say something is democratic, and in this, belocracy is democratic in a different way: instead of everyone having a vote, belocracy means that everyone is welcome to contribute their greatest knowledge and ability in arguments for evaluation panels to consider and proposals for improvement.
Appeals
So, what happens when someone outside the panel disagrees with an evaluation? What checks and balances serve to prevent panels from mistakes and abuse? It is inevitable that humans will make wrong decisions and come to wrong conclusions, so what matters are society’s options for responding to that case.
Given that evaluation panels resemble courts, it would seem obvious to have a supreme panel to review appeals. The US Supreme Court functions -- at least, it makes decisions which are considered final -- but also attracts constant criticism for partisanship and bias. The main reason in favor of the Supreme Court's existence is to prevent situations where two courts disagree and we don't have anyone who decides between them. To steal a term from computer science, we want a system which terminates when it evaluates any specific question, rather than one where the question just bounces eternally from court to court or panel to panel. Before we accept the design of the judiciary with its hierarchical levels for courts, we should ask ourselves whether we can have a system which terminates, but without the hierarchy.
At its best, evaluation should connect directly to the ground truth of how people experience society. Successful policies are ones that make our lives better. Great infrastructural and regulatory organizations make our lives better and protect us from bad outcomes without stifling innovation. E.g. I am generally very grateful for the FAA’s successful safety record with airplanes in the US (despite recent Boeing mishaps, 2023 was the safest on record). Some disagreements between evaluation panels will come down to disagreements about the facts on the ground. Where they disagree about interpretations of whether something is good, they can turn this into a factual question by determining whether the citizenry sees it as good. The remaining difficult case is where the citizenry disagrees internally - where around as many see it as good as see it as bad, or where some specific subgroup sees it as discriminatorily harmful towards them.
Let's consider the last case at a later date, as part of a general discussion of how belocracy deals with controversial topics. For now, let's talk about how evaluation panels take on disagreements over facts, whether facts about policy outcomes or facts about citizen happiness. For this, we turn to science, which is the best we have for determining facts. Ideal adversarial scientists who disagree about which evidence to trust or how to interpret it should agree to a single experiment which will convince them both. Adversarial collaboration is the gold standard for resolving particularly thorny scientific disagreements, but is also difficult interpersonally, so of course they're rare. We all respond to incentives, and our current system does not demand we rise to meet this bar. Belocracy does. Evaluation panels that disagree on the data or interpretation of data that has been submitted to them are required to design an experiment together.
Just as courts accept appeals for a limited set of reasons, evaluation panels generally accept appeals only when you can demonstrate that some evidence was improperly understood, included or excluded. Did the evaluation panel ignore or miss evidence which they should have considered? Did they place too much weight on evidence that you can show to be biased or wrong? Did they misunderstand the statistics of a study? An appeal within the evaluation system is similar in many ways to an ideal peer review or response paper within the scientific community. It focuses on misunderstandings of reality.
If your grounds for appeal seem worth considering, another evaluation panel can reopen the evaluation to review the specific thing that you object to. If an evaluation panel accepts an appeal of a ruling from another panel, after their review they send their opinion to the original panel. That panel can either accept the review and updated decision or demand that an experiment be set up. It is the review panel's job to design the experiment, though the original panel must agree to it before the experiment is considered finalized. If they cannot agree, a third panel can be randomly chosen to act as moderator on experimental design.
A Supreme Panel could also be the path of appeals, and staffed with those best able to understand science and statistics. We could pick Supreme Panelists by running a SIEVE with current and former valors as the candidate pool. Hierarchical and non-hierarchical belocratic evaluations are likely to be close to equivalent. I prefer non-hierarchical because it reduces the number of people we tell hero (or villain) stories about, and better matches the ever-evolving nature of the scientific consensus, but hierarchical is perhaps simpler for many people to understand and reason about. Hierarchical panels could also respond to disagreeing with a lower panel by setting up an experiment, so the core idea works in both situations.
In plenty of cases, the panels may all agree on the interpretation (or not want to accept the work of having to design an experiment) and the citizen may still be unsatisfied. This is almost guaranteed to be the case given how often we see fringe theory believers disagreeing with mainstream consensuses. Sometimes, however, the mainstream consensus by those in positions of power serve themselves at the expense of the citizenry at large. For belocracy to qualify as self-government, there must be an opportunity for accountability when the citizenry disagrees with the decisions of those who have been selected to serve them. We'll discuss the details of petitions in a later chapter, but briefly, if a citizen can corral enough agreement in support of a petition that the evaluation panels are incorrect, their question will be reviewed by a petition jury which is selected from the citizenry at large. The evaluation panel explains its reasoning, as does any citizen who wishes to, and the petition jury is the final arbiter. This ensures that the citizenry at large have a final option to reject the work of the specialists (the valors) who they have entrusted with these critical decisions. The bar to raise a petition has to be high enough to prevent this from being abused, but low enough to be a real check on the power of these positions.
Who (e)valuates the Valors?
A design principle of belocracy is to build feedback loops on decisions that matter. Evaluations are decisions that matter, so at first glance, it would seem that reviewing them and incentivizing high quality decisions would be in line with this design principle. It's possible to do this with a SIEVE-selected jury of citizens, but I suspect that reviewing evaluations like this isn't going to work out well enough to be worth the effort. It may be something that a belocratic system can experiment with, but I don't think it's essential.
What is essential is that valors can be removed, both for corruption, bad decisions, or incompetence, and that failing evaluation panels can be terminated to make way for new ones. This provides a feedback mechanism for the worst cases and a check on abuse and corruption.
Valors can be removed in several ways. I mentioned the petition system above, and it is also open for citizens who have discovered corruption or incompetence to campaign against it directly. Corrupt valors won't last long, whether they are direct abusers of power or influence peddlers, since the person who initiates the successful petition against them will receive quite a bit of reputation for correctly pointing it out. Incompetence can be harder to be certain about, but is likely to also get them removed. We'll talk about petitions more in a later chapter.
I mentioned that valors demonstrate their skills to each other on a regular basis. Those who do poorly in the rankings are likely to be encouraged to retire, as they are both potential petition targets and also potentially the cause of a panel being terminated.
The petition system is also the mechanism that citizens can use to target poor panels for termination, whether they're making bad decisions or just mediocre ones. If a pattern of poor decisionmaking can be shown by the petitioner, petition juries should default to shutting down the panel and initiating a new one. As we'll discuss further when we talk about societal benefit organizations, our goal should always be an ecosystem which actively encourages innovation, and stagnant organizations need to die and allow room for people to create replacements. A new panel chosen by SIEVE will take on the task of re-evaluating any decisions that the petition jury deemed to be particularly egregious, and will then aim to do better than their predecessors.
Lastly, there is a role we haven't talked about yet, titled the belocrat. The belocrats together can impeach and remove a valor, just as they can remove another belocrat or, really, anyone in any key role in society. This is a last resort which is unlikely to be used frequently, equivalent to successful impeachments of Presidents and Supreme Court justices.
In the next chapter we’ll dive into the belocratic data system whose job is to generate and surface ideas and interesting pieces of evidence.
If you find my work interesting, please share and subscribe. It helps tremendously.
In general, belocracy selects people via SIEVE and reputation minimums to prevent itself from turning into the prestige laundering credentialist cocktail hour we have now.