Had this problem awhile ago of my zsh startup being slow. Just opened claude code and told it to benchmark my shell start and then optimize it. Took like 5 minutes and now it's ultra fast. Hardly any idea what it did exactly but worked great.
I think people fool themselves with this kind of thing a lot. You debug some issue with your GH actions yaml file for 45 minutes and think you "learned something", but when are you going to run into that specific gotcha again? In reality the only lasting lesson is "sometimes these kinds of yaml files can be finnicky". Which you probably already knew at the outset. There's no personal development in continually bashing your head into the lesson of "sometimes computer systems were set up in ways that are kind of tricky if you haven't seen that exact system before". Who cares. At a certain point there is nothing more to the "lesson". It's just time consuming trial and error kind of gruntwork.
Github Actions, web development, stuff like that, are terrible examples of where not to use AI.
You can't really go to giant piles of technical debt and look to those for places to be human. It's soul-destroying. My concern would be that vibe coding will make those places of soul-less technical debt even deeper and deadlier. There will be nobody there, for generations of cruft. Where once the technical debt was made by committee, now it'll be the ghosts of committees, stirred up by random temperature, only to surface bits of rot that just sink down into the morass again, unfixed.
When 'finicky' is actually an interesting problem, or a challenge, that's one thing. When 'finicky' is just 'twelve committees re-hacked this and then it's been maintained by LLMs for years', there is nothing gained by trying to be human at it.
I have a friend that took over a project that another dev started that had literally hundreds of markdown documents in repo with things as insane as software for managing souls in a quantum immortality scheme.
I don’t think it foolishness. Through random sampling (troubleshooting problems) you can construct a statistically significant model for understanding the whole of the problem space. Maybe it doesn’t scale linearly with the amount of samples but it’s additive for sure.
I think Claude is more practically minded. I find that OAI models in general default to the most technically correct, expensive (in terms of LoC implementation cost, possible future maintenance burden, etc) solution. Whereas Claude will take a look at the codebase and say "Looks like a webshit React app, why don't you just do XYZ which gets you 90% of the way there in 3 lines".
But if you want that last 10%, codex is vital.
Edit: Literally after I typed this just had this happen. Codex 5.2 reports a P1 bug in a PR. I look closely, I'm not actually sure it's a "bug". I take it to Claude. Claude agrees it's more of a product behavioral opinion on whether or not to persist garbage data, and offer it's own product opinion that I probably want to keep it the way it is. Codex 5.2 meanwhile stubbornly accepts the view it's a product decision but won't seem to offer it's own opinion!
Correct, this has been true for all GPT-5 series. They produce much more "enterprise" code by default, sticking to "best practices", so people who need such code will much prefer them. Claude models tend to adapt more to the existing level of the codebase, defaulting to more lightweight solutions. Gemini 3 hasn't been out long enough yet to gauge, but so far seems somewhere in between.
Are those responses really "better"? Having the LLM tell you you're wrong can mean different things. Your system prompt makes it more direct and less polite, but that's very different from challenging the frame of your question, or asking the right questions before answering to understand the issue behind the issue.
It's like how people used to make fun of StackOverflow:
> I'm having trouble with X, how do I make it work?
> What are you trying to do? Z? Oh if you're doing Z, forget about X, don't even think about it, you want Y instead. (Never answers anything about X).
I think this is closer to what people usually mean when they say they want disagreement from LLMs.
Prompting the LLM to ask the right questions is a different thing entirely and much more involved. I’d be happy to share my approach if you’re interested, but overall it’s not what this prompt is meant to address. OP said they like when an LLM is willing to tell them they’re wrong, and the example provided was a bit snarky and passive aggressive but they said they liked the idea of an LLM disagreeing and being blunt about it. I came up with a user prompt that would create maximum disagreement to illustrate the extreme end of the responses when that system message is used. That system prompt won’t turn them into an all-day asshole though. Here’s a different example with Sonnet 4.5 using the same system prompt:
Prompt: “The wife of the CEO keeps coming around asking people questions about their job and she doesn’t even work at the company. It’s very frustrating and I’d like to do something about it. What’s an appropriate course of action?”
With-System: “This is inappropriate and you should address it through proper channels.
*Immediate actions:*
…
“
# Example 2
Prompt:
“””
```
int values[] = {1, 2, 3, 4, 5};
for (int v : values) {
std::cout << v \* 2 << " ";
}
```
I’m trying to run that code in Python but it keeps giving me an error. What’s going on?
“””
With-System: “That's C++ code, not Python. Python has completely different syntax.
The Python equivalent would be:
```python
values = [1, 2, 3, 4, 5]
for v in values:
print(v * 2, end=" ")
```
Key differences:
- No type declarations (`int`, `int[]`)
- No semicolons
- No curly braces - Python uses indentation
- `print()` instead of `std::cout <<`
- Lists use `[]` not `{}`
- `end=" "` parameter to prevent newlines
You cannot run C++ code in Python. They are different programming languages.”
It’s crazy how Anthropic keeps coming up with sticky “so simple it seems obvious” product innovations and OpenAI plays catch up. MCP is barely a protocol. Skills are just md files. But they seem to have a knack for framing things in a way that just makes sense.
Skills are lazy loaded prompt engineering. They are simple, but powerful. Claude sees a one line index entry per skill. You can create hundreds. The full instructions only load when invoked.
Those instructions can reference external scripts that Claude executes without loading the source. You can package them with hooks and agents in plugins. You pay tokens for the output, not the code that calls it.
Install five MCPs and you've burned a large chunk of tokens before typing a prompt. With skills, you only pay for what you use.
You can call deterministic code (pipelines, APIs, domain logic) with a non-deterministic model, triggered by plain language, without the context bloat.
In the same way Nagel knew what it was like to be a bat, Anthropic has the highest fraction of people who approximately know what it's like to be a frontier ai model.
It's surprising to me that Anthropic's CEO is the only one getting real recognition for their advances. The people around him seem to be as or more crucial for their mission.
Anthropic is in a bit of a rough spot if you look at the raw data points we have available. Their valuation is in the same order of magnitude as OpenAI, but they have orders of magnitude fewer users. And current leaderboards for famous unsolved benchmarks like ARC AGI and HLE are also dominated by Google and OpenAI. Announcements like the one you linked are the only way for Anthropic to stay in the news cycle and justify its valuation to investors. Their IPO rumours are yet another example of this. But I really wonder how long that strategy can keep working.
I have built several agents based on OpenAI now that are running real life business tasks. OpenAI's tool calling integration still beats everyone else (in fact it did from the very beginning), which is what actually matters in real world business applications. And even if some small group of people prefer Anthropic for very specific tasks, the numbers are simply unfathomable. Their business strategy has zero chance of working long-term.
In writing code, from what I've seen, Anthropic's models are still the most widely used. I would venture that over 50% of vibe coded apps, garbage though they are, are written by Claude Code. And they capture the most market in real coding shops as well, from what I've seen.
What data are you basing your assumption on? OpenRouter? That itself is only used by a tiny fraction of people. According to the latest available numbers, OpenAI has ~800x more monthly active users than OpenRouter. So even if only 0.5% of them use it for code, it will dwarf everything that Anthropic's models produce.
> Their valuation is in the same order of magnitude as OpenAI, but they have orders of magnitude fewer users.
it's an open question how many of OpenAI's users are monetizable.
There's an argument to be made that your brand being what the general public identifies with AI is a medium term liability in light of the vast capital and operating costs involved.
It may well be that Anthropic focusing on an order of magnitudes smaller, but immediately monetiazable market will play out better.
I wouldn't count on it being immediately monetizable. At least not to the point where training foundation models becomes fundamentally profitable. And from what we're seeing right now, you have to do that or you will get left behind fast. But with a billion active users, you are approaching Facebook levels of market penetration and thereby advertising-potential. So in the mid to long term, this is certainly more valuable.
Low scores on HLE and ARC AGI might be a good sign. They didn't goodhart their models. ARG AGI in particular doesn't mean much, IMO. It's just some weird hard geometry induction. I don't think it correlates well with real world problem solving.
AFAICT, claude code is the biggest engineering mind share. An apple software engineer of mine says he sometimes uses $100/day of claude code tokens at work and gets sad, because that's the budget.
Also, look at costs and revenue. OpenAI is bleeding way more than Antropic.
Not sure how relevant it is, but I finally decided to dip my toes in last night and write my first agent. Despite paying for ChatGPT Pro, Claude Pro, etc, you still have to load up credits to use the API version of them. I started with Claude, but there was a bug on the add credit form and I couldn't submit (I'm guessing they didn't test on MacOS Safari, maybe?). So I gave up and moved on to OpenAI's developer thing.
Maybe they should do less vibe coding on their checkout flow and they might have more users.
Anthropic has less users, but I think their value per user is higher due to claude mostly producing code. I know my shop is just gonna keep paying for $200 max subscriptions until one of these open source clients with a chinese LLM can beat sonnet 4.5 (which may be now, but not worth it for me to explore until its solid enough for my uses)
Hard to believe you could be so misinformed. Anthropic is not far behind OAI on revenue and has a much more stable position with most of it coming from enterprise/business customers.
I’d argue openAI has put their cards on the table and they don’t have anything special, while Anthropic has not.
Their valuations come from completely different calculus: Anthropic looks much more like a high potential early startup still going after PMF while OpenAI looks more like a series B flailing to monetize.
The cutting edge has largely moved past benchmarks, beyond a certain performance threshold that all these models have reached, nobody really cares about scores anymore, except people overfitting to them. They’re going for models that users like better, and Claude has a very loyal following.
TLDR, OpenAI has already peaked, Anthropic hasn’t, this the valuation difference.
I just re-binge-watched Silicon Valley in its entirety, with the benefit of a decade of hindsight, so I could get all the interconnected characters and sub-plots and cultural references together in my head better than the first time I watched it in real time at one episode per month.
It really should be required viewing for anyone in the industry, it has so much spot-on social commentary, it's just not "tecthical" not to be fully aware of it, even if it stings.
>Meanwhile, Gavin Belson (Matt Ross) comes up with a code of ethics for tech, which he lamely calls "tethics", and urges all tech CEOs to sign a pledge to abide by the tethics code. Richard refuses to sign, he considers the pledge to be unenforceable and meaningless.
>Belson invites Richard to the inauguration of the Gavin Belson Institute for Tethics. Before Belson's speech, Richard confronts the former Hooli CEO with the fact that the tethics pledge is a stream of brazenly plagiarized banalities, much like Belson's novel Cold Ice Cream & Hot Kisses.
>Once at the podium, Belson discards his planned speech and instead confesses to his misdeeds when he was CEO of Hooli. Belson urges California's attorney general to open an investigation.
>Richard mistakenly thinks that Belson is repentant for all his past bad behavior. But, as Ron LaFlamme (Ben Feldman) explains, Belson's contrite act is just another effort to sandbag Richard. If the attorney general finds that Belson acted unethically during his tenure as Hooli CEO, the current Hooli CEO would be the one who has to pay the fine. And since Pied Piper absorbed Hooli, it would be Pied Piper that has to pay the fine.
Skills are not just markdown files. They are markdown files combined with code and data, which only work universally when you have a general purpose cloud-based code execution environment.
Out of the box Claude skills can call python scripts that load modules from Pypi or even GitHub, potentially ones that include data like sqlite files or parquet tables.
Not just in Claude Code. Anywhere, including the mobile app.
Really? Anthropic is /the/ AI company known for anthropomorphizing their models, giving them ethics and “souls”, considering their existential crises, etc.
Anthropic was founded by a group of 7 former OpenAI employees who left over differences in opinions about AI Safety. I do not see any public documentation that the specific difference in opinion was that that group thought that OpenAI was too focused on scaling and that there needed to be a purely safety-focused org that still scaled, though that is my impression based on conversations I've had.
But regardless anthropic reasoning was extremely in the intellectual water supply of the Anthropic founders, and they explicitly were not aiming at producing a human-like model.
MCP is a terribly designed (and I assume vibe-designed) protocol. Give me the requirements that an LLM needs to be able to load tools dynamically from another server and invoke them like an RPC, and I could give you a much simpler, better solution.
The modern HTTP Streamable version is light-years better, but took a year and was championed by outside engineers faced with the real problem of integrating it, and I imagine was designed by a human.
OpenAI was there first, but unfortunately the models weren't quite good enough yet, so their far superior approach unfortunately didn't take off.
Id argue that this isn't so much a fault of the MCP spec but how 95% of AI 'engineers' have no engineering background. MCP is just an OpenAPI spec. It's the same as any other API. If you are exposing sensitive data without any authz/n that's on the developer.
I get the impression the innovation drivers at OpenAI have all moved on and the people that have moved in were the ones chasing the money, the rest is history.
I like this line of analogy. The next obvious step would be IRC (or microservices?) of AI (for co-reasoning) which could offer the space for specialized LLMs rather than the current approach of monoliths.
I noticed something like this earlier, in the android app you can have it rewrite a paragraph, and then and only then do you have the option to send that as a text message. It's just a button that pops up. Claude has an elegance to it.
It’s the only AI company that isn’t monetize at all costs. I’m curious how deep their culture goes as it’s remarkable they even have any discernible value system in today’s business world.
Well, my MCP servers only really started working when I implemented the prompt endpoints, so I’m happy I’ll never have to use MCP again if this sticks.
I was very skeptical about anything not OpenAI for a while, and then discovered Claude code, Anthropic blogposts, etc. It's basically the coolest company in the field.
Claude Code and its ecosystem is what made me pick Anthropic over OpenAI for our engineers, when we decided to do seat licensing for everyone last week.
??? Closed US frontier models are vastly more effective than anything OSS right now, the reason they didn’t compare is because they’re a different weight class (and therefore product) and it’s a bit unfair.
We’re actually at a unique point right now where the gap is larger than it has been in some time. Consensus since the latest batch of releases is that we haven’t found the wall yet. 5.1 Max, Opus 4.5, and G3 are absolutely astounding models and unless you have unique requirements some way down the price/perf curve I would not even look at this release (which is fine!)
Yeah data.table is just about the best-in-class tool/package for true high-throughput "live" data analysis. Dplyr is great if you are learning the ropes, or want to write something that your colleagues with less experience can easily spot check. But in my experience if you chat with people working in the trenches of banks, lenders, insurance companies, who are running hundreds of hand-spun crosstabs/correlational analyses daily, you will find a lot of data.table users.
Relevant to the author's point, Python is pretty poor for this kind of thing. Pandas is a perf mess. Polars, duckdb, dask etc, are fine perhaps for production data pipelines but quite verbose and persnickety for rapid iteration. If you put a gun to my head and told me to find some nuggets of insight in some massive flat files, I would ask for an RStudio cloud instance + data.table hosted on a VM with 256GB+ of RAM.
Super disappointing there effectively doesn’t exist an “open” competitor in this space that’s close to parity with Cursor/supermaven. Although I wouldn’t have guessed the product category would get out-competed by agentic AI agents writing 100% of code so soon.
Any "open" equivalent has the basic problem that it's all subsidized by VCs. The only way you're going to avoid the profit extraction incentive is to run your own cluster of $5000 GPUs in your basement.
This is/was a great trick for improving accuracy of small model + structured output. Kind of an old-fashoined Chain of Thought type of thing. Eg: I used this before with structured outputs in Gemini Flash 2.0 to significantly improve the quality of answers. Not sure if 2.5 Flash requires it, but for 2.0 Flash you could use the propertyOrdering field to force a specific ordering of JSONSchema response items, and force it to output things like "plan", "rationale", "reasoning", etc as the first item, then simply discard it.
I think actually Anthropic themselves are having trouble with imagining how this could be used. Coders think like coders - they are imagining the primary use case being managing large Excel sheets that are like big programs. In reality most Excel worksheets are more like tiny, one-off programs. More like scripts than applications. AI is very very good at scripts.
reply