Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add property oid #3436

Open
VladimirAlexiev opened this issue Jan 4, 2024 · 23 comments
Open

add property oid #3436

VladimirAlexiev opened this issue Jan 4, 2024 · 23 comments
Labels
no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!).

Comments

@VladimirAlexiev
Copy link

VladimirAlexiev commented Jan 4, 2024

@danbri and @alex-jansen:

#2915 added https://schema.org/iso6523Code (see that issue and https://en.wikipedia.org/wiki/ISO/IEC_6523 for a description).

However, ISO 6523 is just the 1.3 branch of the ITU/ISO/IEC object ID (oid) hierarchy. See https://en.wikipedia.org/wiki/Object_identifier and https://www.wikidata.org/wiki/Property:P3743.

OIDs can be browsed and resolved at http://oid-info.com/, eg http://oid-info.com/get/1.3.60 is DUNS.

OID can also be used as URN, eg urn:oid:1.3.60.
We could even use this as a property (not that I'd recommend it), and eg declare:

s:duns owl:equivalentProperty urn:oid:1.3.60.
s:leiCode owl:equivalentProperty urn:oid:1.3.199.
# CAGE is urn:oid:1.3.141

Now consider https://www.wikidata.org/wiki/Q7095072 Ontotext and its DUNS "053393007" and CAGE "6H8F4".
We could express them as the following alternatives (assuming that ":" is used as a separator):

<https://kg.ontotext.com/resource/agent/ontotext> a s:Organization;
  s:duns "053393007"; s:iso6523Code "60:053393007"; s:oid "1.3.60:053393007"; urn:oid:1.3.60 "053393007";
  s:iso6523Code "141:6H8F4"; s:oid "1.3.141:6H8F4"; urn:oid:1.3.141 "6H8F4";

Here's a proposed definition:

s:oid a rdf:Property;
  rdfs:label "object id";
  rdfs:comment """
ITU/ISO/IEC object identifier as per ITU X.660 standard.
Can be used for organizations, registered companies, locations (GS1 GLN), goods (GS1 GTIN), banks, ATMs, application layer protocols, IP address registrants, file & document formats, information objects, local/remote procedures, etc etc.
`oid` uses a dotted numeric notation that represents a tree of values, eg `1.3.6.1.4.1` represents IANA enterprise numbers, and `1.3.6.1.4.1.343` is Intel Corporation.
The oid `1.3` represents the ISO/IEC 6523 International Code Designator (ICD) that corresponds to the property `iso6523Code`.
ITU object IDs allow only numbers, but schema.org allows as an extension the last part to be separated by `:` and to consist of any characters.
For example, the NATO Commercial And Government Entity (NCAGE) code has oid `1.3.141`. Since "04KE8" is one of the NCAGE codes of Intel, we can use oid `1.3.141:04KE8` to represent Intel.
Standard numeric OIDs can be browsed and resolved at http://oid-info.com/, eg http://oid-info.com/get/1.3.141 is NCAGE.
OIDs can also be used as URNs, e.g. `urn:oid:1.3.141` can represent the NCAGE property, and `1.3.141:04KE8` can be used as a URN for Intel.
""";
  rdfs:seeAlso <https://en.wikipedia.org/wiki/Object_identifier>,
    <https://www.wikidata.org/wiki/Property:P3743>,
    <http://oid-info.com/>.
@danbri
Copy link
Contributor

danbri commented Jan 4, 2024

What types would we attach this to?

Historically we've tried to minimize things at the Thing level, but it sounds like this would be at least Organization, Product, CreativeWork, Place, ... Is there anything that oid isn't used to identify?

How much "oid data" is out there?

@KalleOlaviNiemitalo
Copy link

Can these instead be encoded using the existing identifier property and PropertyValue? In which propertyID could perhaps be urn:oid:1.3.60 and value could be 053393007. Or change identifier so that DefinedTerm can be used as an alternative to PropertyValue.

@KalleOlaviNiemitalo
Copy link

I'm mainly worried about the oid:extension syntax:

  • Risk of conflict if a future successor of IETF RFC 3061 "A URN Namespace of Object Identifiers" defines a syntax involving a colon
  • Not clear who controls the format of extension -- the owner of the OID with which it is used, or always schema.org?

@thadguidry
Copy link
Contributor

thadguidry commented Jan 5, 2024

For identifiers, we should always be worried about bit rot.
I think identifier with PropertyValue is likely a good candidate.
The worry then would be some consistency for consuming apps to parse/understand the propertyID values (to avoid wild west syntaxes, we might provide guidance here to promote the standard conventions of URN. urn:nid:... etc. https://en.wikipedia.org/wiki/Uniform_Resource_Name )

@thadguidry
Copy link
Contributor

@KalleOlaviNiemitalo I wouldn't worry about any of that if we promoted the URN convention within propertyID values? Its syntax has got a long history, 1997, and already supported in tons of open source. https://datatracker.ietf.org/doc/html/rfc2141

@VladimirAlexiev
Copy link
Author

@KalleOlaviNiemitalo OIDs are a continuous tree where the levels can be anything.
Eg 1.3.6.1.4.1.343 (listed in my Turtle description) is Intel.
This resolves at http://oid-info.com/get/1.3.6.1.4.1.343: the site doesn't know that is "Intel"
but it knows about subdivisions that Intel use: identifiers(1), products(2), experimental(3), information-technology(4) , sysProducts(5), mib2ext(6), hw(7), wekiva(111)

(BTW there's an interesting story why 1.3.6.1.4.1, which is iso(1), identified-organization(3), dod(6), internet(1), private(4), enterprise(1)

  • IANA misappropriated this for their company registrant namespace
  • There is another namespace 1.3.IANA that they are supposed to use
  • But I guess for backward compatibility reasons they stayed under dod ;-)

So in the global OID tree, it is unclear where do you place the boundary between:

  • metadata (properties, identifiers kinds) and
  • data (identifiers of companies, IT devices, whatever)

http://oid-info.com/get/1.3.60.053393007 doesn't resolve because:

  • Dun & Bradstreet will probably never publish their humongous company database as an OID tree
  • It probably isn't feasible to publish a humongous database within that tree
  • The leading zero in 053393007 maybe make it a syntactically invalid element;
    and certainly 04KE8 is an invalid element because of the letters, so 1.3.141.04KE8 would be an invalid oid

I am a bit uneasy with my proposal to use an extension syntax like 1.3.141:04KE8: that's a hackery.


I know about the split identifier [a PropertyValue; propertyID "foo"; value "bar"].


I don't think there's any conflict with RFC 3061 because urn:oid:1.3.141:04KE8 is a fully valid URN.

Truth be told, because of this RFC there's very little loss if this proposal is rejected: I can just use

<https://kg.ontotext.com/resource/agent/ontotext>
   s:identifier <urn:oid:1.3.141:04KE8>;
  # or even
   s:sameAs <urn:oid:1.3.141:04KE8>.

My main motivation is that I researched https://en.wikipedia.org/wiki/ISO/IEC_6523 (a collection of identifier schemes)
and then I figured out that OIDs are a superset of this:
it's an infinitely extensible namespace with delegatable subspaces, before internet domains were invented.
So what I'd really like is for schema.org to document this way of using globally identified properties:
if it goes in the documentation of identifier, I can give up this oid proposal.

@KalleOlaviNiemitalo
Copy link

urn:oid:1.3.141:04KE8 is a fully valid URN.

It matches the namestring syntax in IETF RFC 8141, but its NSS portion 1.3.141:04KE8 violates IETF RFC 3061 and thus is not valid to use with the NID oid:

The NSS portion of the name is strictly limited to the digits 0-9 and the '.' character with no leading zeros.

If this remains invalid as a URN forever, and schema.org defines its own interpretation of the string "urn:oid:1.3.141:04KE8", then there is no conflict. But if IETF RFC 3061 is ever updated to make this syntax valid as a URN, with semantics different from what schema.org assigned, then that will be a conflict.

@danbri
Copy link
Contributor

danbri commented Jan 7, 2024

if it goes in the documentation of identifier, I can give up this oid proposal.

Let's do that. Can you make a PR?

@VladimirAlexiev
Copy link
Author

@KalleOlaviNiemitalo OID syntax is strictly dotted integers (\d+(\.\d+)*) and it has been like this for maybe 20 years. Therefore there is no possibility that RFC 3061 or its successor may appropriate : for use.
The valid concern is: are we ok to willingly violate RFC 3061 by tacking any namespace-specific identifier (which may be non-numeric) after : ?

I personally am ok with this because it seems useful to be able to mint URNs from any identifier:

  • urn:oid:1.3.6.1.4.1.343: Intel as IANA registrant in the pure numeric OID tree
  • urn:oid:1.3.60:047897855: Intel as DUNS. It's not valid to use . because OID numbers cannot start with zero, and I don't think the keepers of the OID register will be willing to record hundreds of millions of DUNS identifiers.
  • urn:oid:1.3.141:04KE8: Intel represented by one of its NCAGE codes. NATO and DoD will never rework NCAGE codes to be numeric, and I don't think the keepers of the OID register will be willing to record several million NATO supplier records.

@danbri I'll make a PR but first say what do you think of the above.
And please answer my questions in #2915 (comment) .

@MatthiasWiesmann
Copy link
Contributor

I would distinguish between two uses of ISO 6523.

  • As a root for the OID tree.
  • As a meta system for organization identifiers.

As mentioned above, the DUNS tree (0060) will probably never be exposed, and conversely I don't expect anyone to consume organization identifiers from the the CERN root (0020). Also note that there are other identifier schemes which have an OID that are not purely numerical identifiers:

  • IBANs (icd: 0021).
  • Swiss UIDs (icd: 0183).

If you look at the list of ICDs the registrar typically make it explicit when they want to use them for ISO 8348.

From my position, the iso6523Code field should be restricted to Organization (and maybe Person). I would really avoid using identifier, because it's usage is really vague and abstract for most users.

@Tiggerito
Copy link

The reference list of ICDs is missing the 9XXX codes which contain a lot of VAT number types.

I read up a bit more and found the 9XXXs in the EAS list I found are not in the ICD list. So I guess they are not ICD codes. Wouldn't it be of value to be able to add things like VAT numbers for organization identifiers?

https://ec.europa.eu/digital-building-blocks/sites/display/DIGITAL/Code+lists
Most other identifiers in the eInvoicing standard can also be based on different identifiers schemes according to the International Code Designator (ICD) code list. When the same identifier scheme is allowed in both the ICD and the EAS code list, it shall use the same code value. Consequently, such codes must first be registered in the ICD code list and then requested to be added to the EAS code list. The EAS code list may also include other identifiers schemes that are specific for electronic addressing, such as emails, Uniform Resource Locator (URL) and other Uniform Resource Identifiers (URI). For details on this procedure of requesting changes to the EAS code list, please contact the DIGITAL Service Desk.

@thadguidry
Copy link
Contributor

@Tiggerito vatID is already a property on Organization if that helps

vatID The Value-added Tax ID of the organization or person.

@Tiggerito
Copy link

@thadguidry Good point. I guess that combined with the organization's address/country would be enough.

@MatthiasWiesmann
Copy link
Contributor

@thadguidry

You assume two things:

  • That a country only has one Tax-ID number format.
  • That a company's tax / vat identifiers are from the country their address is in. For instance companies in Monaco can have a French SIREN number.

Generally, taxID and vatID are fields which are difficult to parse, because the syntax is not specified (do you use the common formatting for the country), you need the value from another field (country) to even know what it is, and and even with the country value, parsing still has to handle ambiguities. ISO-6523 is much more constrained, and therefore more robust for parsing.

@Tiggerito

The 9XXX are not official ICDs, these are part of the PEPPOL extension, which is a de facto standard.

@thadguidry
Copy link
Contributor

thadguidry commented Jan 10, 2024

@MatthiasWiesmann I didn't assume any of those. I simply stated that Schema.org provides a property to hold the values. How to format the values, attach additional metadata to the values, that can all certainly be done when coordinating the vatID property with https://schema.org/PropertyValueSpecification could it not? Do you see gaps here if vatId context is that of PropertyValueSpecification or even using multi-typing to provide some external additional context?

FYI, Schema.org typically doesn't get into the formatting weeds of values (since there's often not a need when we also have PropertyValueSpecification) unless absolutely necessary to make publisher/consumer lives easier.

If you need help with PropertyValueSpecification, we can help, and move that discussion to our mailing list, or just directly in our GitHub Discussions button above.

@MatthiasWiesmann
Copy link
Contributor

MatthiasWiesmann commented Jan 10, 2024

My main point is that ISO 6523 solves two problems: type identification and formatting.

If I understand PropertyValueSpecification, we would need to have 1+ per country and it would not help the identification issue, vatID and taxID are quite ambiguous and don't cover the space well, DUNS codes are neither.

@thadguidry
Copy link
Contributor

thadguidry commented Jan 10, 2024

@MatthiasWiesmann Oops! So sorry, that should have been PropertyValue https://schema.org/PropertyValue . But the PropertyValueSpecification comes from Hydra and it's a way to specify a format (using it's valuePattern). If using PropertyValue, to give more detail about an Organizations multiple vatID that they might have, you could use PropertyValue or even provide a more specific StructuredValue using it's disambiguatingDescription "registered in Ireland" as well as identifier, url, etc.

Wouldn't that be enough to know that a vatID value might be in a particular structure pattern? My understanding has always been that once you parse the 2 first country letter codes of the vatID value (or multiple vatID's if provided by a publisher) then you could parse the rest of the value and know its format more easily, no? https://en.wikipedia.org/wiki/VAT_identification_number

or just give context (the ICD part) that it's a https://schema.org/iso6523Code directly on the vatID property.
Sorry, but I just don't understand the format confusion here and why it would be hard to parse, given that you can provide a lot of metadata via all those properties I mentioned, to say what kind of value, who is the authority, where is the record for this vatID registration with that authority (a url), the date of registration, etc. etc. Can you help me understand more why StructuredValue, or iso6523Code property values/types could not help with your format parsing concerns?

@danbri This is likely where we need to provide better docs and guidance on how best to use those. Hmm, and seems we missed adding vatID as a subproperty of https://schema.org/identifier ? Hmm, and there's no back link reference on https://schema.org/vatID to know that folks can set the ICD portion from iso6523 which we do mention on https://schema.org/iso6523Code

@MatthiasWiesmann
Copy link
Contributor

MatthiasWiesmann commented Jan 10, 2024

Sadly, things are complicated.

  • Australia has VAT, Australian VAT numbers are not prefixed with AU.
  • European VAT numbers are prefixed with the country code if they are intercommunity codes (i.e. used EU wide), they technically don't need to be when used only within one country (although the trend is to always put the prefix).
  • Greece is part of the EU, the country code is GR, but VAT numbers are prefixed with EL.
  • Northern Ireland is not a country, but so they have VAT numbers prefixed with an ISO extension (XI).
  • There are magic EU wide VAT numbers prefixed with EU.
  • Swiss VAT numbers are prefixes with CHE which is the ISO 3 letter code.
  • Many of these numbers have two formats, a machine readable version (no space, dots, dashes) and one with a defined formatting. Is a number with the wrong formatting (dots in the wrong place) correct or not?
  • the term taxID is very vague, most countries have a number given out by the government, which might be used in some administrative function, taxation being one of them. Add to the confusion that VAT is taxation (the T is a hint).
  • There are often multiple identifiers at play, for instance in France, the following numbers are relevant:
    • The SIREN code identifies the company.
    • The SIRET code identifies a branch (or the main seat).
    • The RCS code is the SIREN code formatted differently with the addition of the city the company is registered in.
    • The APE code (also called NAF code) determines the activity sector of the company (which has tax implications).
    • The VAT code, which is derived from the SIREN code, but with the FR prefix and an additional checksum.

The core problem is that users will just put whatever makes sense for them into the vatID, respectively taxID field. They could structure the values as PropertyValues, but they probably won't: it's complicated and the set of propertyID is not defined. Finding the list of ICDs is not trivial, the list of propertyId for vatIDs and taxIDs does not exist.

So a parser will have to assume that taxID and vatID are basically synonyms and represent a badly defined variant of the iso6523 field: there are keys and values, except the set of keys is not really defined and neither is the format of the values. Proper validation is only possible by cross-referencing other fields (address, for the country), with magic fallbacks if this is missing. In turn this means online validation will be difficult, so reporting to the user their data does not parse is harder.

Basically, which one do you think I would rather parse and validate?

法人番号: 222000500431
<meta itemprop="iso6523Code" content="0188:222000500431" />

or

  <div itemprop="taxId" itemscope itemtype="https://schema.org/PropertyValue">
      <span itemprop="propertyID">法人番号</span>: 
      <span itemprop="value">2220005004311</span>
  </div>

Yes, you could do

  <div itemprop="taxId" itemscope itemtype="https://schema.org/PropertyValue">
      <meta itemprop="propertyID" content="icd:0188" />法人番号: 
      <span itemprop="value">2220005004311</span>
 </div>

Or

<div itemprop="iso6523" itemscope itemtype="https://schema.org/PropertyValue">
<meta itemprop="propertyID" content="icd:0188" />法人番号: 
<span itemprop="value">2220005004311</span>
</div>

But this more difficult to add to a web-page and more work to parse…

@thadguidry
Copy link
Contributor

thadguidry commented Jan 10, 2024

I much prefer property/value pairs in JSON-LD, and how I usually handle it. I have always found property/value pairs (semantics, which allows easier information exchange) to be easier to parse than values with separators that lack extra information. In fact, the term "parsing" indeed comes into play when separator characters are used to delineate information in a multi-value value. For JSON-LD, as a consumer of the data, the libraries handle the parsing for you, so all you have to do is make sense of values and perhaps custom extensions and RDFa nodes. But that's me.

@Tiggerito
Copy link

Australia is an interesting example. Let's see if I remember correctly.

We have two business identifiers which are used to pay tax:

ACN: Australian Company Number
ABN: Australian Business Number

Both are numbers where the ACN is the ABN without the first few numbers. Only businesses registered as a company have an ACN.

We also have TFN (Tax File Number), which businesses and individuals have.

In Australia we pay GST not VAT. For businesses you track GST paid/charged via the ABN.

ICD has an entry for the ABN (0151) but not the other two. So we can identify a business in Australia via the ABN.

With the https://schema.org/PropertyValue idea, I guess we might be able to use the EAS codes.

<div itemprop="vatId" itemscope itemtype="https://schema.org/PropertyValue">
      <meta itemprop="propertyID" content="peppol:9932" />United Kingdom VAT number
      <span itemprop="value">GB1234567890</span>
 </div>

I found what looks like a better EAS list that indicates what the source is:

https://ec.europa.eu/digital-building-blocks/wikis/download/attachments/467108974/Electronic%20Address%20Scheme%20Code%20list%20-%20version%209%20-%20published%20March2022.xlsx?version=1&modificationDate=1646394201721&api=v2

@Tiggerito
Copy link

That "better EAS list" is incomplete :-(

@MatthiasWiesmann
Copy link
Contributor

I much prefer property/value pairs in JSON-LD, and how I usually handle it. I have always found property/value pairs (semantics, which allows easier information exchange) to be easier to parse than values with separators that lack extra information. In fact, the term "parsing" indeed comes into play when separator characters are used to delineate information in a multi-value value. For JSON-LD, as a consumer of the data, the libraries handle the parsing for you, so all you have to do is make sense of values and perhaps custom extensions and RDFa nodes. But that's me.

The data comes from system which are probably not JSON-LD, and will be parsed into structures which are not JSON-LD.

Breaking up the information in transit does not bring much, and ads risks of breakage. The whole point of standards like ISO 6523 is that values can be transported without any transformation in the same way as country codes (ISO 3166), language codes (ISO 639), date-times (ISO 8601).

Copy link

This issue is being nudged due to inactivity.

@github-actions github-actions bot added the no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!). label Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!).
Projects
None yet
Development

No branches or pull requests

6 participants