Edurep:Metadata verwerking/en: verschil tussen versies

Uit Kennisnet Developers Documentatie
Naar navigatie springen Naar zoeken springen
(kop)
 
(6 tussenliggende versies door dezelfde gebruiker niet weergegeven)
Regel 1: Regel 1:
 
{{Talen}}
 
{{Talen}}
Edurep '''harvest''' diverse repositories via het OAI-PMH protocol. Vervolgens kunnen eindgebruikers via de zoekindex '''zoeken''' in het materiaal dat Edurep op deze wijze heeft verzameld. Om de kwaliteit en uniformiteit te garanderen worden er allerlei validaties, bewerkingen en translaties uitgevoerd op de metadata. In het schema hieronder staat een versimpelde weergave van de processen die plaatsvinden binnen Edurep.
+
Edurep '''harvests''' various repositories via the OAI-PMH protocol. End users can then use the search index to '''search''' the material that Edurep has collected in this way. To guarantee quality and uniformity, all kinds of validations, operations and translations are carried out on the metadata. The diagram below is a simplified representation of the processes that take place within Edurep.
# records komen binnen via een specifieke interface
+
# records come in through a specific interface
  +
# in addition to an initial validation (XML/access), content validation takes place
# naast een initiële validatie (xml/toegang) vindt er inhoudelijke validatie plaats
 
# er worden verschillende [[Edurep:Zoekopdracht/recordSchema|opvraagbare representaties]] gemaakt
+
# various [[Edurep:Zoekopdracht/recordSchema/en|retrievable representations]] are created
# de Schema.org variant wordt gebruikt in de [[#Koppelingen|aggregator]] en zoekindex
+
# the Schema.org variant is used in the [[#Connections|aggregator]] and search index
   
 
[[File:EdurepComponentenSimplified.png]]
 
[[File:EdurepComponentenSimplified.png]]
 
== OAI-PMH Harvester ==
 
== OAI-PMH Harvester ==
De harvester haalt de records binnen vanaf de aanbiedersrepository en fungeert daarmee als toegangspoort tot Edurep. Records die niet valideren worden geweigerd en de status is te bekijken op de harvester status pagina.
+
The harvester retrieves the records from the provider repository and thus acts as a gateway to Edurep. Records that do not validate are rejected and the status can be viewed on the harvester status page.
* Productie: https://harvester.edurep.kennisnet.nl/showHarvesterStatus?domainId=prod10
+
* Production: https://harvester.edurep.kennisnet.nl/showHarvesterStatus?domainId=prod10
 
* Staging: https://harvester.edurep.kennisnet.nl/showHarvesterStatus?domainId=staging10
 
* Staging: https://harvester.edurep.kennisnet.nl/showHarvesterStatus?domainId=staging10
   
=== Harvester Status pagina ===
+
=== Harvester Status page ===
  +
The meaning of each column in the status overview is explained below:
Per kolom in het statusoverzicht volgt nu een uitleg over de betekenis ervan:
 
* ''Repository'': De identifier van de repository (of een link naar de repository gegevens, alleen beschikbaar voor de beheerder van Edurep).
+
* ''Repository'': The identifier of the repository (or a link to the repository data, only available to the Edurep administrator).
* ''Last successful harvest'': Timestamp van de laatste keer dat de harvester een "fout vrij" bezoek heeft gedaan.
+
* ''Last successful harvest'': Timestamp of the last time the harvester made an "error free" harvest.
* ''Total records'': Totaal aantal records die de harvester heeft opgehaald. (Dit aantal is '''niet''' per definitie hetzelfde als het aantal in Edurep. Het kan bijvoorbeeld voorkomen dat onze [[Edurep:Metadata_verwerking#Deadlink_Checker|deadlink checker]] records met dode links opruimt.)
+
* ''Total records'': Total number of records retrieved by the harvester. (This number is '''not''' necessarily the same as the number in Edurep. For example, our [[Edurep:Metadata_verwerking/en#Deadlink_Checker|deadlink checker]] may clean up records with dead links.)
* ''Harvested/Uploaded/Deleted'': De verhouding tussen het aantal nieuwe of gewijzigde, en verwijderde records van het laatste harvest bezoek.
+
* ''Harvested/Uploaded/Deleted'': The ratio between the number of new, changed, and deleted records from the last harvest visit.
* ''#Validation Errors'': Het aantal validatiefouten. De link verwijst naar een lijst met alle fouten onderaan de statuspagina.
+
* ''#Validation Errors'': The number of validation errors. The link points to a list of all errors at the bottom of the status page.
* ''#Errors'': Het aantal errors. De link verwijst naar een lijst met alle fouten onderaan de statuspagina.
+
* ''#Errors'': The number of errors. The link points to a list of all errors at the bottom of the status page.
* ''RSS'': Geeft toegang tot een rss feed voor een specifieke aangesloten collectie.
+
* ''RSS'': Provides access to an RSS feed for a specific connected repository.
   
=== Validatiefouten ===
+
=== Validation errors ===
Wanneer een record succesvol wordt geharvest kan deze voor de opname in de zoekmachine nog geweigerd worden op basis van een validatiefout. Op dit moment wordt er alleen gevalideerd tegen het LOM XML schema (beide bindingen IEEE en IMS kunnen worden aangeboden).
+
When a record is successfully harvested, it can still be rejected for inclusion in the search engine based on a validation error. Currently, validation is only performed against the LOM XML schema (both IEEE and IMS bindings can be offered).
   
  +
Unlike an error, harvesting continues after a validation error, to a certain extent. In principle, no further harvesting takes place after 100 validation errors. Then an error occurs for the harvester.
In tegenstelling tot een error wordt er wel doorgeharvest na een validatiefout, tot op zekere hoogte. In principe wordt er na 100 validatiefouten niet meer doorgeharvest, er treedt dan voor de harvester een error op.
 
   
Elke validatiefout kan vanuit het foutenoverzicht individueel bekeken worden. De eerste regel van zo'n foutmelding bevat de gegenereerde samenvatting van de fout. In een aantal gevallen zal deze melding voldoende zijn om de fout in kwestie op te sporen. De getoonde xml staat hier in de IEEE LOM binding, maar is inhoudelijk gelijk aan het ingeschoten record.
+
Each validation error can be viewed individually from the error overview. The first line of such an error message contains the generated summary of the error. In a number of cases this message will be sufficient to detect the error in question. The xml shown here is in the IEEE LOM binding, but is identical to the entered record.
   
Soms zegt deze regel echter "Line 105: Unable to transform record". De fout zelf is dan in het xml bestand terug te vinden, en omgeven door expliciete Edurep error xml elementen:
+
Sometimes this line says "Line 105: Unable to transform record". The error itself can then be found in the xml file and is marked with explicit Edurep error xml elements:
   
 
<syntaxhighlight lang="xml" line="GESHI_FANCY_LINE_NUMBERS" line start="105">
 
<syntaxhighlight lang="xml" line="GESHI_FANCY_LINE_NUMBERS" line start="105">
Regel 39: Regel 39:
 
</syntaxhighlight>
 
</syntaxhighlight>
   
  +
If a validation error is repaired by the provider and is offered with a new updated timestamp according to OAI, the record will be harvested regularly again and the validation error will disappear.
Wanneer een validatiefout bij de aanbieder wordt gerepareerd en volgens OAI met een nieuwe updated timestamp wordt aangeboden, wordt het record weer regulier meegeharvest en verdwijnt de validatiefout.
 
   
 
== Deadlink Checker ==
 
== Deadlink Checker ==
De Deadlink Checker controleert of een record een '''geldige''' en '''werkende URL''' bevat in het [[Edurep:Metadata/Url|url veld]]. Een record kan één van de volgende statussen krijgen:
+
The Deadlink Checker checks whether a record contains a '''valid ''' and '''working URL''' in the [[Edurep:Metadata/Url/en|URL field]]. A record can be given one of the following statuses:
* OK: Het resultaat van de url is een 2.x.x of 3.x.x HTTP status code
+
* OK: The result of the URL is a 2.x.x or 3.x.x HTTP status code
* NTL: Het record bevat geen url (No Technical Location)
+
* NTL: The record does not contain a URL (No Technical Location)
* FAILED: De url is niet valide of leidt tot een timeout of een 4.x.x of 5.x.x HTTP status code
+
* FAILED: The URL is not valid or leads to a timeout or a 4.x.x or 5.x.x HTTP status code
   
Records met status FAILED worden vervolgens niet getoond in Edurep zoekresultaten.
+
Records with status FAILED are then not shown in Edurep search results.
   
Ongeveer één keer in de week worden alle records met de status "OK" gecontroleerd. De records met de status "FAILED" worden elke dag gecontroleerd.
+
About once a week, all records with the status "OK" are checked. The records with the status "FAILED" are checked every day.
   
{{Info|Het kan toch nog voorkomen dat sommige leerobjecten niet als een dead link worden aangemerkt doordat de URL uitkomt op een zogenaamde landingspagina die zich meldt als een OK ipv FAILED. Deze situaties zijn lastig te herkennen.}}
+
{{Info|It may still happen that some learning materials are not classified as a dead link because the URL ends up on a so-called landing page that reports as an OK instead of FAILED. These situations are difficult to recognize.}}
   
=== Deadlink Checker Status Pagina ===
+
=== Deadlink Checker Status Page ===
  +
There is a status page where you can request an overview of the dead links per repository.
Er is een statuspagina waar per repository een overzicht van de dode links is op te vragen.
 
* Productie: https://wszoeken.edurep.kennisnet.nl/status
+
* Production: https://wszoeken.edurep.kennisnet.nl/status
 
* Staging: https://staging.edurep.kennisnet.nl/status
 
* Staging: https://staging.edurep.kennisnet.nl/status
   
  +
A short description for each column in the status overview:
Per kolom in het statusoverzicht een korte omschrijving:
 
* ''Repository'': De repository identifier zoals deze in Edurep bekend is.
+
* ''Repository'': The repository identifier as it is known in Edurep.
* ''Vindbare records'': Het aantal geharveste records minus de records met dode links.
+
* ''Vindbare records'': The number of harvested records minus the records with dead links.
* ''Deadlink records'': Het aantal records met dode links.
+
* ''Deadlink records'': The number of records with dead links.
* ''Totaal'': Het aantal geharveste records
+
* ''Totaal'': The number of harvested records.
   
Wanneer men doorklikt op een individuele repository, kan men zich abboneren op de rss feeds van de harvester en de deadlink checker, alsmede een overzicht van alle dode link [[Edurep:Metadata/Record Identifier|recordIdentifiers]] downloaden.
+
When you click on one of the repositories, a special page opens where you can subscribe to the RSS feeds of the harvester and deadlink checker, and where you can download an overview of all dead link [[Edurep:Metadata/Record Identifier/en|record identifiers]].
   
== Bewerkingen en Validatie ==
+
== Edits and Validation ==
  +
There are various editing and validation processes in Edurep to increase the quality of records.
Er bestaan in Edurep verschillende bewerkings- en validatieprocessen om de kwaliteit van records te verhogen.
 
  +
=== missing values ===
  +
In order to improve the quality of metadata, Edurep fills in a certain number of metadata fields '''if they have not been filled in by the provider'''. This concerns:
 
* costs: Edurep enters costs=yes if no [[Edurep:Metadata/Kosten/en|costs]] have been entered.
 
* publisher: Edurep fills in the repository_id as publisher when the provider does not provide a [[Edurep:Metadata/Uitgever/en|publisher]].
   
=== ontbrekende waarden ===
+
=== vocabulary values ===
  +
Various vocabulary fields are validated in Edurep.
Omwille van de kwaliteit van de metadata vult Edurep een aantal velden van de metadata '''indien deze niet zijn ingevuld door de aanbieder'''. Het gaat om:
 
* kosten: Edurep vult cost=yes in wanneer [[Edurep:Metadata/Kosten|kosten]] niet aanwezig is.
 
* uitgever: Edurep vult de repository_id in als publisher wanneer de aanbieder geen [[Edurep:Metadata/Uitgever|uitgever]] meegeeft.
 
 
=== vocabulaire waarden ===
 
In Edurep worden verschillende vocabulairevelden inhoudelijk gevalideerd.
 
   
 
'''legacy:'''<br/>
 
'''legacy:'''<br/>
De legacy oplossing werkt alleen met [https://github.com/kennisnet/edurep-xslt xslt's]. Hierin worden deels oude waarden [[Edurep:Mappen_naar_het_OBK|gemapt]] naar nieuwe waarden, foutieve waarden verwijderd of aangepast. Deze oplossing zal na de 2021-11 release worden uitgefaseerd.
+
The legacy solution only works with [https://github.com/kennisnet/edurep-xslt XSLTs ]. Here, old values are partly [[Edurep:Mappen_naar_het_OBK|mapped]] to new values, incorrect values are removed or adjusted. This fix will be phased out after the 2021-11 release.
   
 
'''new:'''<br/>
 
'''new:'''<br/>
In de nieuwe oplossing is het voor ons eenvoudiger geworden om meer verbetermogelijkheden toe te passen, maar is ook [[Edurep:Veldenlijst/Schema.org/invalid|explicieter uit te vinden]] wat er niet is gevalideerd.
+
The new solution makes it easier for us to apply more improvement options, but it is also easier [[Edurep:Veldenlijst/Schema.org/invalid/en|to find out more explicitly what has not been validated.]]
   
 
=== classificaties ===
 
=== classificaties ===
Alle NL LOM classificatie velden worden op een bepaalde manier omgezet naar schema.org. Het doel is om de schema.org [[Edurep:Metadata/Vak|vak]], [[Edurep:Metadata/Leerniveau|leerniveau]] en [[Edurep:Metadata/Doel|doel]] te reserveren door algemene curriculum vocabulaires. Globaal hanteren we een aantal regels die in volgorde worden afgelopen:
+
All NL LOM classification fields are converted to schema.org in a certain way. The goal is to use only curriculum-defined terms in the schema.org for [[Edurep:Metadata/Vak/en|subjects]], [[Edurep:Metadata/Leerniveau/en|educational levels]] and [[Edurep:Metadata/Doel|competences]]. Generally speaking, we apply a series of rules in the following order:
# een OBK identifier (<nowiki>http://purl.edustandaard.nl/begrippenkader/*</nowiki>) binnen purpose type {{Code|discipline}}, {{Code|educational level}} of {{Code|comptency}} wordt geplaatst in respectievelijk {{Code|schema:educationalAlignment}}, {{Code|schema:educationalLevel}} en {{Code|schema:teaches}}.
+
# the OBK identifier (<nowiki>http://purl.edustandaard.nl/begrippenkader/*</nowiki>) for purpose {{Code|discipline}}, {{Code|educational level}} or {{Code|comptency}} will be placed in respectively in {{Code|schema:educationalAlignment}}, {{Code|schema:educationalLevel}} and {{Code|schema:teaches}}.
# voor specifieke oude VDEX vocabulaire waarden wordt een vertaling gemaakt van oude waarde naar OBK identifier
+
# for specific old VDEX vocabulary values, a translation is made from old value to OBK identifier
# specifieke regels voor andere classificatiewaarden, met onder meer de mapping van de [[Edurep:Metadata/Toegang|toegangsrechten]]
+
# special rules are made for other classification purposes, for example mapping for [[Edurep:Metadata/Toegang/en|access rights]].
# alles wat niet matcht wordt binnen als [[Edurep:Metadata/Trefwoorden|trefwoord]] opgeslagen, met behoud van vocabulaire informatie
+
# anything that does not match is stored as a [[Edurep:Metadata/Trefwoorden/en|keyword]], while retaining vocabulary information.
   
Op de 2021 migratie pagina is een [[Edurep:Migraties/2021#Overzicht|volledig overzicht]] te vinden.
+
[[Edurep:Migraties/2021#Overzicht|A complete overview]] of the conversion can be found on the 2021 migration page.
   
 
=== OBK ===
 
=== OBK ===

Huidige versie van 13 nov 2023 om 22:32

Nl.gif Nederlands En.gif English

Edurep harvests various repositories via the OAI-PMH protocol. End users can then use the search index to search the material that Edurep has collected in this way. To guarantee quality and uniformity, all kinds of validations, operations and translations are carried out on the metadata. The diagram below is a simplified representation of the processes that take place within Edurep.

  1. records come in through a specific interface
  2. in addition to an initial validation (XML/access), content validation takes place
  3. various retrievable representations are created
  4. the Schema.org variant is used in the aggregator and search index

EdurepComponentenSimplified.png

OAI-PMH Harvester

The harvester retrieves the records from the provider repository and thus acts as a gateway to Edurep. Records that do not validate are rejected and the status can be viewed on the harvester status page.

Harvester Status page

The meaning of each column in the status overview is explained below:

  • Repository: The identifier of the repository (or a link to the repository data, only available to the Edurep administrator).
  • Last successful harvest: Timestamp of the last time the harvester made an "error free" harvest.
  • Total records: Total number of records retrieved by the harvester. (This number is not necessarily the same as the number in Edurep. For example, our deadlink checker may clean up records with dead links.)
  • Harvested/Uploaded/Deleted: The ratio between the number of new, changed, and deleted records from the last harvest visit.
  • #Validation Errors: The number of validation errors. The link points to a list of all errors at the bottom of the status page.
  • #Errors: The number of errors. The link points to a list of all errors at the bottom of the status page.
  • RSS: Provides access to an RSS feed for a specific connected repository.

Validation errors

When a record is successfully harvested, it can still be rejected for inclusion in the search engine based on a validation error. Currently, validation is only performed against the LOM XML schema (both IEEE and IMS bindings can be offered).

Unlike an error, harvesting continues after a validation error, to a certain extent. In principle, no further harvesting takes place after 100 validation errors. Then an error occurs for the harvester.

Each validation error can be viewed individually from the error overview. The first line of such an error message contains the generated summary of the error. In a number of cases this message will be sufficient to detect the error in question. The xml shown here is in the IEEE LOM binding, but is identical to the entered record.

Sometimes this line says "Line 105: Unable to transform record". The error itself can then be found in the xml file and is marked with explicit Edurep error xml elements:

105<edurep:error xmlns:edurep="http://meresco.org/namespace/users/kennisnet/edurep">
106 <lom:keyword xmlns:lom="http://www.imsglobal.org/xsd/imsmd_v1p2">
107  <lom:langstring xml:lang="nl"/>
108 </lom:keyword>
109</edurep:error>

If a validation error is repaired by the provider and is offered with a new updated timestamp according to OAI, the record will be harvested regularly again and the validation error will disappear.

Deadlink Checker

The Deadlink Checker checks whether a record contains a valid and working URL in the URL field. A record can be given one of the following statuses:

  • OK: The result of the URL is a 2.x.x or 3.x.x HTTP status code
  • NTL: The record does not contain a URL (No Technical Location)
  • FAILED: The URL is not valid or leads to a timeout or a 4.x.x or 5.x.x HTTP status code

Records with status FAILED are then not shown in Edurep search results.

About once a week, all records with the status "OK" are checked. The records with the status "FAILED" are checked every day.

Info.gif It may still happen that some learning materials are not classified as a dead link because the URL ends up on a so-called landing page that reports as an OK instead of FAILED. These situations are difficult to recognize.

Deadlink Checker Status Page

There is a status page where you can request an overview of the dead links per repository.

A short description for each column in the status overview:

  • Repository: The repository identifier as it is known in Edurep.
  • Vindbare records: The number of harvested records minus the records with dead links.
  • Deadlink records: The number of records with dead links.
  • Totaal: The number of harvested records.

When you click on one of the repositories, a special page opens where you can subscribe to the RSS feeds of the harvester and deadlink checker, and where you can download an overview of all dead link record identifiers.

Edits and Validation

There are various editing and validation processes in Edurep to increase the quality of records.

missing values

In order to improve the quality of metadata, Edurep fills in a certain number of metadata fields if they have not been filled in by the provider. This concerns:

  • costs: Edurep enters costs=yes if no costs have been entered.
  • publisher: Edurep fills in the repository_id as publisher when the provider does not provide a publisher.

vocabulary values

Various vocabulary fields are validated in Edurep.

legacy:
The legacy solution only works with XSLTs . Here, old values are partly mapped to new values, incorrect values are removed or adjusted. This fix will be phased out after the 2021-11 release.

new:
The new solution makes it easier for us to apply more improvement options, but it is also easier to find out more explicitly what has not been validated.

classificaties

All NL LOM classification fields are converted to schema.org in a certain way. The goal is to use only curriculum-defined terms in the schema.org for subjects, educational levels and competences. Generally speaking, we apply a series of rules in the following order:

  1. the OBK identifier (http://purl.edustandaard.nl/begrippenkader/*) for purpose discipline, educational level or comptency will be placed in respectively in schema:educationalAlignment, schema:educationalLevel and schema:teaches.
  2. for specific old VDEX vocabulary values, a translation is made from old value to OBK identifier
  3. special rules are made for other classification purposes, for example mapping for access rights.
  4. anything that does not match is stored as a keyword, while retaining vocabulary information.

A complete overview of the conversion can be found on the 2021 migration page.

OBK

Filling in a label for a classification identifier is not mandatory, but it is convenient for search portals to display meaningful names in search results without setting up their own lookup service. Edurep therefore always fills in the labels for each valid purl.edustandard.nl/begrippenkader classification identifier of learning level, subject or goal.

  • Any existing label will be overwritten.
  • If a taxon does not contain an ID, but only an entry, the taxon is removed from the record because the validity of the entry cannot be determined.

Example input:

<taxonpath>
  <source>
    <langstring xml:lang="x-none">http://purl.edustandaard.nl/begrippenkader</langstring>
  </source>
  <taxon>
    <!-- OBK-id for Secondary Education -->
    <id>2a1401e9-c223-493b-9b86-78f6993b1a8d</id>
  </taxon>
  <taxon>
    <id>512e4729-03a4-43a2-95ba-758071d1b725</id>
    <entry>
        <langstring xml:lang=”nl”>PO</langstring>
    </entry>
  </taxon>
</taxonpath>

Result:

<taxonpath>
  <source>
    <langstring xml:lang="x-none">http://purl.edustandaard.nl/begrippenkader</langstring>
  </source>
  <taxon>
    <id>2a1401e9-c223-493b-9b86-78f6993b1a8d</id>
    <!-- The entry is automatically completed -->
    <entry>
        <langstring xml:lang=”nl”>Voortgezet Onderwijs</langstring>
    </entry>
  </taxon>
  <taxon>
    <id>512e4729-03a4-43a2-95ba-758071d1b725</id>
    <!-- The entry has been automatically overwritten -->
    <entry>
        <langstring xml:lang=”nl”>Primair Onderwijs</langstring>
    </entry>
  </taxon>
</taxonpath>

vCard

A VCARD in a centity is scanned by Edurep to enable searches by "author" or "publisher". Of all the possible variables that a vCard can contain, N, FN or ORG are used as possible values for the names of authors or publishers. The value from the first of those three variables that is entered will be used. vCard must be compatible with version 3.0. In that case, FN, N and VERSION are mandatory fields and therefore cannot be omitted.

Connections

During processing, connections are established between records so that, for example, they can be searched based on the average rating for all related records.

Review

A link between a review and a learning material record is made based on the hreview:info field in SMO and 1st of the object identifiers in a learning material record. After the 2021-11 release, a match will not only be made on the 1st record identifier, but a match can be made on all record identifiers. This also means that a review can be linked to a record that did not have a direct match with the original review. For example, a review that points to id:1 via id:2 can also be linked to record B:

  • record A
    • id:1
    • id:2
  • record B
    • id:2
    • id:3