-
Notifications
You must be signed in to change notification settings - Fork 823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add an ML Model type to schema.org, likely under /Dataset, ... and potentially other major types (SingleTabularDataset, MultipleTabularDataset?) #3140
Comments
I concur that dataset in schema.org is vague. And that an MLModel is not a dataset, but an abstraction of patterns found in a data stimulus. The term dataset is also hard to understand (or often misunderstood) in the dublin core world as well. From an archivist's perspective (even a digital archivist's perspective), I'd like to point out there is a difference between a collection and a dataset. A collection is a set of items (files) associated for some reason, whereas a dataset is curated, arranged, and designed to be computer read as a coherent data structure. This applies to to things like a collection of |
A Dataset is something else than a machine learned model. I also agree there is something missing in schema.org for 'model'. I would suggest, however, to introduce Model instead of the more specific MLModel. Or perhaps both, with MLModel subclassing Model. A Model is broarder and can be a 3D model of a molecule (which is a fit of real world data) or a Model describing some phenomenon like a pathway model. |
We also have https://schema.org/ProductModel Latter is close but defined as a MediaObject |
Ah, nice! 3DModel will work well for chemistry. I still like to see that Model superclass :) |
Hi! I build @huggingface, a platform where the machine learning community can host and collaborate on machine learning models and datasets. I support the addition of a Potential data we could encode in a type (some of which we already validate in a structured way on our side):
|
@julien-c that is great! Thanks for jumping in. Some of that ought to be expressible already. E.g. we have fileFormat, and we aready have SoftwareSourceCode and SoftwareApplication, e.g. former giving us a runtimePlatform & codeRepository properties. For the more advanced information listed here I'd hope to identify some effort to use/consume that data, and find cases where multiple model-hosting sites have the same information. My colleague Natasha Noy and I have been debating whether it is counter-productive to call ML models "datasets" in some general sense, and how to deal with this. We agree that in the ML context that is confusing terminology, whereas Schema.org has a much more expansive notion of "dataset" (covering non-quantitative research artifacts etc.), so it is awkward to exclude. One option here could be some kind of analogy with the way we (and DCAT) represent dataset distribution as "DataDownload", i.e. very specific concrete bundles of bytes, and represented in the MediaObject type hierarchy (alongside ImageObject, VideoObject). If we were to put MLModel there it would need some convention/property for linking it from Dataset. Different versions (e.g. smaller for mobile on-device use, or older, ...) of a would be different MLModel media objects, whereas if we stick with the idea of MLModel as a subtype of Dataset then a single dataset could be an umbrella construct that referenced such versions as DataDownloads. I believe that either MLModel as a subtype of Dataset, or MLModel as a subtype of MediaObject, ... could be made to work. I suggest we proceed by putting something basic into the next release of Schema.org as a foundation to explore from... |
A twist on this design:
It is important for schema changes to be driven by commitments of someone to build user-facing features that use the proposed vocabulary at scale. Without this we tend to go in circles, lacking guidance. I have spoken with colleagues at Google who are responsible for Dataset Search, and they see some potential for this to be useful by enriching the information available about a dataset. For example, indicating MLModels out there have been trained on that dataset. For the case of supervised learning trained on a dataset, the path looks very straightforward. I would like to take a moment to look at situations beyond that, e.g. Reinforcement Learning, where the model is trained against runs of a simulation of some kind. For example https://huggingface.co/format37/BipedalWalker-v3 is using OpenAI Gym in python, which has dependencies on Numpy, Box2D etc. How do MLModels indicate non-data environments critical to their creation? How precise is appropriate here, ... docker images, records of random seeds etc.? Are we running into situations that would be better handled separately for all software packages / systems? |
I’m still struggling to see MLModels as datasets any more than an MS word
document is a Dataset. In my work with ML I’ve always seen these as an
abstraction of a dataset (the set of data used to train the model). In this
sense then the training data and the MLmodel are independent creative
works. Can someone link me to an explanation on how the thing-ness of an
MLModel qualifies as a dataset?
Respectfully,
- Hugh
On Thu, Aug 11, 2022 at 1:24 PM Dan Brickley ***@***.***> wrote:
A twist on this design:
- MLModel is just another subtype of CreativeWork
- Similarly to the situation we have with things that are usefully
described as a Book and also as a Product, anyone who is describing an
MLModel that they think is usefully considered a "Dataset" can simply use
multiple typing.
- While there is an expansive view of the notion of a "dataset" that
encompasses ML models, pretty much anything digital can be handled as a
dataset if the situation arises. This is comparable to "Product" being
anything at all offered for sale, or art being anything you can get hung on
a wall in an art gallery, etc.
It is important for schema changes to be driven by commitments of someone
to build user-facing features that use the proposed vocabulary at scale.
Without this we tend to go in circles, lacking guidance. I have spoken with
colleagues at Google who are responsible for Dataset Search
<https://datasetsearch.research.google.com/>, and they see some potential
for this to be useful by enriching the information available about a
dataset. For example, indicating MLModels out there have been trained on
that dataset.
For the case of supervised learning trained on a dataset, the path looks
very straightforward. I would like to take a moment to look at situations
beyond that, e.g. Reinforcement Learning, where the model is trained
against runs of a simulation of some kind. For example
https://huggingface.co/format37/BipedalWalker-v3 is using
<https://github.com/openai/gym/blob/master/gym/envs/box2d/bipedal_walker.py>
OpenAI Gym in python, which has dependencies on Numpy, Box2D etc. How do
MLModels indicate non-data environments critical to their creation? How
precise is appropriate here, ... docker images, records of random seeds
etc.? Are we running into situations that would be better handled
separately for all software packages / systems?
—
Reply to this email directly, view it on GitHub
<#3140 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAJ2JRK62HSZSSJAACN4KTVYUZNPANCNFSM54EUMJTA>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
All the best,
-Hugh
…Sent from my iPhone
|
ML has a lot of assets that are really interesting and potentially useful to have better defined in schema.org - for sure datasets, and ability to set types around datasets in a better manner - comment from hugging face above are in line with our thoughts. Vectors and weights are also highly useful. With models, model metadata may be very valuable - e.g. model purpose, known biases, links to training data, etc all could be very valuable |
This issue is being nudged due to inactivity. |
We're working on https://github.com/enRichMyData/InnoGraph, a KG of the AI innovation ecosystem, and ML Models are one of the kinds of entities that we are after. For me
There's a lot of things that can be tracked about ML, as explained by @julien-c and as exemplified eg at Which of these attributes should we elaborate about |
|
@EmidioStani through twitter:
In addition to the above-mentioned, I got some notes about these, in case we want to go even deeper:
And I have this "ML interop levels" diagram, not sure from where: |
@EmidioStani @joaquinvanschoren |
I don't see anything in common apart from the word "Model". Given that schema.org has refused to create class |
I like
👍 I also don't like seeing |
Hmm, we're also missing a property for |
Some notes/thoughts:
|
Dataset is pretty vague, it can cover anything from .zip files of .wavs of social science interviews, application-specific on-disk file formats, etc etc. In theory we could have subtypes for all of these, but it would be a neverending task, so we have tended to address such needs through indirect mechanisms like https://schema.org/fileFormat
The practice of publishing and consuming ML models is different, and worth exploring a special subtype, to allow consuming services to differentiate machine learning models from other unrelated things. This is partially to do with the massive attention associated with "ML", "AI" etc., but also relates to the relatively inscrutable nature of trained ML models. You get something that can map inputs to outputs or act as a controller of an agent of some kind, ... but how it works internally is relatively unclear. This is in stark contrast to the most stereotypical kinds of Datasetty Dataset, i..e. where you have the rows and columns of descriptive tables, and some reasonable clarity on how the parts of the dataset relate to the whole.
In many ways machine learning models are more like software artifacts than they are like (for example) "open data" CSV files. Historically Schema.org hasn't included https://schema.org/SoftwareSourceCode or https://schema.org/SoftwareApplication as subtypes of /Dataset, ... although a case could be made to do so. Similarly Docker (etc.) packaging, or Debian/CPAN/NPM etc packages are somewhat related.
This issue comes out of discussions (with colleagues and in wider world) about the scope of "Dataset", and the concern that it will just be super confusing to try to persuade ML folk to markup up models as datasets, whereas in their preferred technology, models are built from datasets.
So -
How about we add (as an exploratory, "Pending" term) a type: MLModel
Adding an MLModel type would allow publishers to distinguish models from other datasets more clearly, and in a way that could be used by consuming applications. Recently there has been an increasing recognition of the risks around bias and inappropriate re-use of pre-trained machine learning models, as well as a concern that the high cost of training certain kinds of model makes it important for such models to be shareable, lowering barriers to entry of the ML/AI field. In this climate, it is ideal when data publishers provide not only the information describing a pre-trained machine learning model they're offering, but also the dataset(s) upon which the model was trained. Calling both of these published artifacts "Dataset" may be confusing, ...despite ML models fitting under Schema.org's pretty broad notion of Dataset. My proposal here partially resolves this by adding a convenience type. For those who care to dig they might find out that MLModel is considered a special kind of Dataset, but most won't even encounter this information.
There's more to say on this but I've written too much already, and wanted to start some discussion. Opinions welcomed!
The text was updated successfully, but these errors were encountered: