Yep. Building a project that needs some LLMs. I'm very much of the self-hosting mindset so will try DIY, but it's very obviously the wrong choice by any reasonable metric.
OpenAI will murder my solution by quality, by availability, by reliability and by scalability...all for the price of a coffee.
It's a personal project though & partly intended for learning purposes so there is scope for accepting trainwreck level tradeoffs.
No idea how commercial projects are justifying this though.
Their data retention policy on their APIs is 30 days, and it's not used for training [0]. In addition, qualifying use cases (likely the ones you mentioned) qualify for zero data retention for most endpoints.
In sensitive cases you do not think about the normal policy, you think about the worst case. You just can't afford a leak. Your local installation may be much better protected than a public service, by technology and by policy.
For years people have essentially made a living off FUD like "ignore the literal legal agreement and imagine all the worst case scenarios!!!" to justify absolutely farcical on-premise deployments of a lot of software, but AI is starting to ruin the grift.
There are some cases where you really can't afford to send Microsoft data for their OpenAI offering... but there are a lot more where some figurehead solidified their power by insisting the company build less secure versions of public offerings instead of letting their "gold" go to a 3rd party provider.
As AI starts to appear as a competitive advantage, and the SOTA of self-hosted lagging so ridiculously far behind, you're seeing that work less and less. Take Harvey.ai for example: it's a frankly non-functional product and still manages to spook top law firms with tech policies that have been entrenched for decades into paying money despite being OpenAI based on the simple chance they might get outcompeted otherwise.
Gah, this is just not how it works. You are probably right that e.g. patient information, private conversations, proprietary code, etc would be safe with OpenAI. But it's not the on-prem team that needs to convince the rest of the organization to keep things on prem. Quite the opposite -- every single tech person would love to make our data someone else's problem (and get a big career boost from dealing with cloud tech instead of the dead-end that is local sysadmin!).
But you just can't. You cannot trust the scrappy startup OpenAI. You can't even trust Microsoft's normal cloud offering, because the people who actually give a fuck about the risk NEED to have granular detail of what data, readable by whom, is stored exactly where and for how long, and how can you make sure, and how do you know that access is scoped to the absolute minimum number of people, and is there a paper trail for that?
For these "figureheads": the buck, stopping, here, etc.
> because the people who actually give a fuck about the risk NEED to have granular detail of what data, readable by whom, is stored exactly where and for how long, and how can you make sure, and how do you know that access is scoped to the absolute minimum number of people, and is there a paper trail for that?
You realize that Microsoft is a publicly traded company that has multiple privacy certifications? They have to subject to detailed data ownership and consumption audits. They most definitely have data ownership and retention logs and you can request for copies of their certification audits to understand how they log/track this. I think the parent is being too optimistic but your answer is so comically simplistic it's silly. I highly suggest you read about the world of HIPAA, PCI, and FedRAMP instead of just thinking "omg the data".
Absolutely. I am well aware of this, as are most tech people, that's what I'm saying. It's not us that are trying to convince our orgs to build a rack in the basement "to be more secure" just because we want to hear the fans running and see the lights blinking.
It's the lawyers that you need to convince. Good luck convincing any bigco lawyer that your company's data is safe on openAI because their legal agreement says "we don't train on API calls."
Vendor risk management is a thing, and plenty of companies that work with medical data or legal data or financial data or sensitive government data and are, in fact, able to store that data with their vendors. This is a thing that happens all the time by literally all of the companies.
This nonsense that you can't trust anyone with your data is completely unfounded
You're not saying anything counter to what I said.
> You cannot trust the scrappy startup OpenAI
Not saying you do: Azure has a dedicated capacity driven GPT-4/3.5 offering that you can stick in your VPC with everything from PCI to HITRUST certs. These are the things that come out if you actually care about delivering solutions vs jumping to deliver the right sounding words for the figureheads like "We'd never trust those scrappy OpenAI guys!!!!"
> Quite the opposite -- every single tech person would love to make our data someone else's problem (and get a big career boost from dealing with cloud tech instead of the dead-end that is local sysadmin!).
You're attracting the least equipped people who tumbled into what you just admitted is a dead end trajectory, usually paying below market rates as a result, and then expecting them to outperform the people paying the most money for competent security outlays with much bigger fish (Azure is working with teams that need FedRAMP, DoD certs, HIPPA compliance, and much more)
The end result is that you end up with a poorly maintained leak sieve of an infrastructure in which Azure would likely be the most secure component you have to lean on in your entire organization.
You say:
> because the people who actually give a fuck about the risk NEED to have granular detail of what data, readable by whom, is stored exactly where and for how long, and how can you make sure, and how do you know that access is scoped to the absolute minimum number of people, and is there a paper trail for that
They don't care about risk, they care about flawed perceptions of risk that don't align with reality. These are the same companies that get pwned for years through some basic social engineering, and all that they ever have to show for it is audit logs that show who ac... ah wait no one ever actually checked the logs and it turns out they're useless because subsystem X Y and Z aren't even connected to it.
It's “not be used to train or improve OpenAI models”, doesn't mean it's not used to get knowledge about your prompts, your business use case. In fact, the wording of the policy is lose enough they could train a policy model on it (just not the LLM itself).
They have a soc 2 so literally an (external) auditor has looked at their data retention policies and if your business is a customer you should request access to the report
but does that matter legally for health, finance and legal sectors? I am not familiar with the laws themselves but I worked in finance for a long time and the internal rules where that sensitive data cannot move off premises no matter what the external party promised/had certified.
Yes there are certifications for each of those sectors. Finance has pci compliance, health has HIPAA.
For legal issues it's a bit more nuanced (eg new york state has guidelines about best practices, but they're honestly fairly sensible and would probably allow SOC 2 or equivalent)
The best thing anybody can do with your data is not store it for very long. Beyond that, they should take sensible measures, like encrypt it at rest, have policies restricting access, etc
For basic usage, you can get away with a small graphics card or no graphics card at all (albeit it will be very slow).
The general rule of thumb is, take a model size (7B, 13B, 34B, 70B) and multiply that by 0.5 or 0.625. If that number is smaller than the combined amount of system RAM and VRAM in your system, you can run the model at 4-bit and 5-bit quantization respectively.
A jacked up PC can do really well and there is much fun to be had there.
...but you'd struggle to get close to even GPT 3.5 let alone 4 for generic tasks.
For custom tunes...yeah sure custom rolls will beat generic openAI. But that's a bit like pitting customed tuned cars against street legal manufacturer cars. It's an apple to oranges comparison
A lot of tools for constraint, creativity, and related rely on manipulating the entire log probability distribution. OpenAI won’t expose this information and is therefor shockingly uncompetitive on things like poetry generation
Maybe this has improved, but a few months ago, OpenAI p99 latency was much worse than a self hosted solution, which would be a problem in certain cases.
OpenAI will murder my solution by quality, by availability, by reliability and by scalability...all for the price of a coffee.
It's a personal project though & partly intended for learning purposes so there is scope for accepting trainwreck level tradeoffs.
No idea how commercial projects are justifying this though.