Developer-faq
From RDFaWiki
ManuSporny (Talk | contribs) m (→How does one prevent bad triples from corrupting a local triple store?) |
ManuSporny (Talk | contribs) m (→QNames have been identified as a known anti-pattern, does RDFa revive QName use?) |
||
| Line 70: | Line 70: | ||
QNames have been [http://intertwingly.net/blog/2008/11/12/RDFaification#c1226526843 identified as a known anti-pattern], does RDFa revive QName use? | QNames have been [http://intertwingly.net/blog/2008/11/12/RDFaification#c1226526843 identified as a known anti-pattern], does RDFa revive QName use? | ||
| + | |||
| + | It is a common misconception that RDFa uses QNames. | ||
| + | |||
| + | RDFa does not use QNames. The specification has defined the CURIE datatype with explicit parsing rules, and it has been specifically defined as <b>not</b> mapping to (namespace,local), but instead to a full URL. RDFa does not use a browser's handling of QNames, and whatever brokenness that might exist with QNames doesn't apply to CURIEs or RDFa. | ||
| + | |||
| + | We only use the xmlns declarations of mapping prefixes, never QName expansion. | ||
| + | |||
| + | RDFa does not use QNames. | ||
== What about the conflict between HTML5 and RDFa with the use of CURIEs in the @rel attribute? == | == What about the conflict between HTML5 and RDFa with the use of CURIEs in the @rel attribute? == | ||
Revision as of 16:15, 17 February 2009
Search Questions
Would a search system based on RDFa give better results?
Would a search system based on RDF or RDFa give a better answer to searches done in Google (in 2009)? How? Does it require all data to be marked up as RDFa?
Which is better, RDFa or Natural Language Processing?
Can an RDF/RDFa system do better from a natural language query?
Will authors widely and reliably use RDFa?
Do we have reason to believe that it is more likely that we will get authors to widely and reliably include such relations than it is that we will get high quality natural language processing? Why?
How does RDFa deal with unstructured natural language search queries?
How would an RDF/RDFa system deal with the problem of the _questions_ being unstructured natural language?
Isn't an HREF good enough for expressing links between concepts on the web?
Isn't an <a href=""> suitable for this already?
Data Sharing
How does RDFa work with companies that publish their data in non-RDFa formats?
How would an RDF/RDFa system deal with data provided by companies that have no interest in providing the data in RDF or RDFa? (e.g. companies providing data dumps in XML or JSON.)
How does RDFa work with companies that don't want to provide their data for free?
How would an RDF/RDFa system deal with companies that do not want to provide the data free of charge?
How does RDFa deal with authors screwing up and encoding bad data?
Like most technologies, RDFa in and of itself is incapable of preventing mis-use. The same pitfalls hold true for authoring HTML, JSON, or even e-mail and articles created using any written language. People make spelling and grammatical mistakes quite often and there are systems and tools to detect these mistakes and sometimes correct them.
RDFa is not much different than the English language. There are basic preventative measures that are included in the language, such as not generating triples when a prefix is unknown, or if a triple is malformed. However, it is fairly difficult to prevent authors from eventually making mistakes. There are tools out already, such as Fuzzbot, that allow web page authors to see the triples and data that they markup.
If data is malformed, the UIs that use that data, such as Fuzzbot, will clearly show information that is not what the author intended. This method of seeing an error gives them an opportunity to correct the error.
Authoring tools are also planned that will reduce the number of hand-authoring errors in content management systems, such as Drupal and Wordpress.
How does RDFa deal with apathy from sites that you want to scrape data from?
How does RDFa deal with apathy from sites that you want to scrape data from?
How does RDFa deal with deal with spammers or other malicious authors encoding misleading data?
How does RDFa deal with deal with spammers or other malicious authors encoding misleading data?
How does RDFa enable monetization for producers who are intentionally obfuscating the data today?
How does RDFa enable monetization for producers who are intentionally obfuscating the data today?
How does RDFa track per-developer usage of their data?
How would an RDF/RDFa system deal with companies that want to track per-developer usage of their data?
How is RDFa going to help sites like Wikipedia?
How is RDFa going to make the thousands or millions of Wikipedia contributors faster?
Doesn't RDFa create invisible meta-data and isn't that a bad idea?
Doesn't RDFa create invisible meta-data and isn't that a bad idea?
Process
The RDFa Task Force was only chartered to solve the metadata problem in XHTML, so why bother with HTML4 and HTML5?
The RDFa Task Force was only chartered to solve the metadata problem in XHTML, so why bother with HTML4 and HTML5?
HTML and XHTML Differences
HTML is parsed differently than XHTML, is it possible to write one RDFa parser to parse both XHTML and HTML?
HTML is parsed differently than XHTML, is it possible to write one RDFa parser to parse both XHTML and HTML?
QNames have been identified as a known anti-pattern, does RDFa revive QName use?
QNames have been identified as a known anti-pattern, does RDFa revive QName use?
It is a common misconception that RDFa uses QNames.
RDFa does not use QNames. The specification has defined the CURIE datatype with explicit parsing rules, and it has been specifically defined as not mapping to (namespace,local), but instead to a full URL. RDFa does not use a browser's handling of QNames, and whatever brokenness that might exist with QNames doesn't apply to CURIEs or RDFa.
We only use the xmlns declarations of mapping prefixes, never QName expansion.
RDFa does not use QNames.
What about the conflict between HTML5 and RDFa with the use of CURIEs in the @rel attribute?
What about the conflict between HTML5 and RDFa with the use of CURIEs in the @rel attribute?
Authoring
Why does RDFa use CURIEs?
RDFa uses CURIEs for the following reasons:
- It eases the cognitive load for the web developer.
- It reduces clutter and eases readability of the HTML code.
- It reduces URL errors introduced by typing out complete URLs.
- It reduces the size of HTML files that contain a large number of RDFa statements.
Easing the Cognitive Load for the Web Developer
Instead of writing out full URLs for predicates (eg: http://purl.org/media/audio#Recording), the use of CURIEs allow the author to write something easier to remember (eg: "audio:Recording"). This reduces the cognitive load on the author if they are writing RDFa by hand. If they are not writing RDFa by hand, the authoring argument is a non-issue.
Reduces Clutter and Eases Readability of HTML Code
Using CURIEs reduces clutter and eases readability of HTML code by reducing the number of URLs that are placed in the HTML document. While this may seem like a small improvement, it certainly does help those that are debugging HTML code to not have to worry about checking every character in every URL that is used as a predicate in RDFa.
Reducing URL errors introduced by typing out complete URLs
The probability that a typing error will occur when repeatedly typing predicate URLs rises significantly as the document size grows. While the possibility still exists when typing a string like 'dcterms:title' or 'audio:Recording', it is less than when typing a URL like http://purl.org/dc/terms/title or http://purl.org/media/audio#Recording.
Reducing the size of HTML files that contain a large number of RDFa statements
While usually a minor issue, page size does matter to larger sites. This was a concern when developing RDFa and a nice side-effect of CURIEs is that they do reduce page size in almost all of the usage scenarios. For example, if we are marking up 20 audio recordings on a single web page, each with 3 predicates each (type, title and singer), we will need to specify
20 * 3 == 60
sixty predicates. With CURIEs, this results in
len("http://purl.org/media/audio#") + len("audio:")*60
388 characters used to express the predicates. Without CURIEs, this results in
len("http://purl.org/media/audio#")*60)
1680 characters used to express the predicates. Using CURIEs results in a 4x reduction in characters used.
What are the draw-backs of using CURIEs?
The most prevalent arguments against the use of CURIEs are:
- If the ratio of the number of vocabularies used to triples generated approach 1, the HTML file is larger than if no CURIEs were used.
- They cause HTML markup to be fragile under copy-paste scenarios.
- Prefixes are difficult to teach and understand.
CURIEs bloat HTML files
While it has been demonstrated that CURIEs can offer 4x reduction in character usage, it is true that if you only use a CURIE once that you will waste a number of characters. This is because the CURIE prefix must first be defined and then used. For example, if you were to specify just the title of a page using the dcterms vocabulary, you would use:
len("xmlns:dcterms='http://purl.org/dc/terms/title'") + len("dcterms:") - len("http://purl.org/dc/terms/title")
24 extra characters. However, if you were to use the dcterms prefix at least twice in your markup, you would save
len("xmlns:dcterms='http://purl.org/dc/terms/title'") + len("dcterms:") - len("http://purl.org/dc/terms/title")*2
6 characters. Most RDFa markup size benefits from the small up-front cost of defining prefixes.
CURIEs make RDFa markup fragile
The most prevalent argument against CURIEs is that they cause page markup to be fragile. If one copies HTML from one website to another, and forgets to copy the prefix definitions for the CURIEs (either by mistake or because they didn't know), then any triples that use those unknown prefixes will stop working. While this is true for cut-and-paste scenarios, it does not hold at all for authoring tools and content management systems which take care of defining the prefixes for the author. The alternative to not use CURIEs was explored and they provided too much benefit to ignore.
Defining Namespaces and Prefixes are Difficult to Teach and Understand
It has been asserted that prefixes, namespaces and non-URI structures are difficult to teach and understand. This is, however, hard to prove as many people use namespaces and prefixes in their everyday lives. http: is a namespace, as is a person's last name. Often it is the method of teaching that is lacking and not the strength of the concept.
Security
Are iframes a security risk to RDFa?
If data in iframes are processed and digital signatures are not used for the data on a page then iframes are indeed a "security risk". The issue is that another site could hijack the data on a page by re-writing or overwriting triple URLs that are defined on a page. For example, an advertisement loaded through an iframe could inject triples for their product/service into the page content that you are viewing. This could manifest itself as a link to the latest Beyonce CD, which is actually a link to the latest Viagra ad.
There are several proposed solutions to this issue:
- Do not process any data contained in an iframe.
- Only process iframe data that is digitally signed by a trusted party.
Do not process any data contained in an iframe
A security option setting could be to ignore all triples contained in iframe data. The option could be enabled when calling the RDFa parser. This would remove the threat entirely, but could result in blocking some interesting uses of RDFa.
Only process digitally signed data
Digital signatures are to become the primary method of verifying the truthiness of statements made on a page. It is important that trusted statements are given greater consideration by a browser viewing a web page. This means that one alternative is to digitally sign every triple or sign a bundle of triples so that a browser can differentiate between data that contains a high level of trust and data that does not contain any assurances. Standardized digital signature technology can be used for these purposes.
How does one prevent bad triples from corrupting a local triple store?
Like other things on the web, there will be certain data sources that you trust and certain ones that you do not trust. If long-term triple storage is a goal for an individual, the browser can shield them from bad data sources by only including data from the following sources:
- Digitally signed data sources
- Non-blacklisted data sources
- White-listed data sources
Using this mechanism would allow browsers to clean and protect the triple store without intervention, or by using an externally trusted blacklisted/whitelisted source similar to spam blacklisting services.