Message ID | pull.1715.git.git.1714343461.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | builtin: implement, document and test url-parse | expand |
On Sun, Apr 28, 2024 at 10:30:48PM +0000, Matheus Moreira via GitGitGadget wrote: > Git commands accept a wide variety of URLs syntaxes, not just standard URLs. > This can make parsing git URLs difficult since standard URL parsers cannot > be used. Even if an external parser were implemented, it would have to track > git's development closely in case support for any new URL schemes are added. > > These patches introduce a new url-parse builtin command that exposes git's > native URL parsing algorithms as a plumbing command, allowing other programs > to then call upon git itself to parse the git URLs and their components. > > This should be quite useful for scripts. For example, a script might want to > add remotes to repositories, naming them according to the domain name where > the repository is hosted. This new builtin allows it to parse the git URL > and extract its host name which can then be used as input for other > operations. This would be difficult to implement otherwise due to git's > support for scp style URLs. > All in all, having a URL parser as such is a good thing, thanks for working on that. There are, however, some notes and questions, up for discussion: - are there any plans to integrate the parser into connect.c and fetch ? Speaking as a person, who manage to break the parsing of URLs once, with the good intention to improve things, I need to learn that test cases are important. Some work can be seen in t5601-clone.sh Especially, when dealing with literal IPv6 addresses, the ones with [] and the simplified ssh syntax 'myhost:src' are interesting to test. Git itself strives to be RFC compliant when parsing URLs, but we do not fully guarantee to be "fully certified". And some features using the [] syntax to embedd a port number inside the simplified ssh syntax had not been documented, but used in practise, and are now part of the test suite. See "[myhost:123]:src" in t5601 - Or is this new tool just a helper, to verify "good" URL's, and not accepting our legacy parser quirks ? Then we still should see some IPv6 tests ? Or may be not, as we prefer hostnames these days ? - One minor comment: in 02/13 we read: +enum protocol { + PROTO_UNKNOWN = 0, + PROTO_LOCAL, + PROTO_FILE, + PROTO_SSH, + PROTO_GIT, The RFC 1738 uses the term "scheme" here, and using the very generic term "protocol" may lead to name clashes later. Would something like "git_scheme" or so be better ? - One minor comment: In 13/13 we read: + git url-parse "file:///" && + git url-parse "file://" I think that the "///" version is superflous, it should already be covered by the "//" version
Thank you for your feedback. > are there any plans to integrate the parser into connect.c and fetch ? Yes. That was my intention but I was not confident enough to touch connect.c before getting feedback from the community, since it's critical code and it is my first contribution. I do want to merge all URL parsing in git into this one function though, thereby creating a "single point of truth". This is so that if the algorithm is modified the changes are visible to the URL parser builtin as well. > Speaking as a person, who manage to break the parsing of URLs once, > with the good intention to improve things, I need to learn that > test cases are important. Absolutely agree. When adding test cases, I looked at the possibilities enumerated in urls.txt and generated test cases based on those. I also looked at the urlmatch.h test cases. However... > Some work can be seen in t5601-clone.sh ... I did not think to check those. > Especially, when dealing with literal IPv6 addresses, > the ones with [] and the simplified ssh syntax 'myhost:src' > are interesting to test. You're right about that. I shall prepare an updated v2 patchset with more test cases, and also any other changes/improvements requested by maintainers. > And some features using the [] syntax to embedd a port number > inside the simplified ssh syntax had not been documented, > but used in practise, and are now part of the test suite. > See "[myhost:123]:src" in t5601 Indeed, I did not read anything of the sort when I checked it. Would you like me to commit a note to this effect to urls.txt ? > Or is this new tool just a helper, to verify "good" URL's, > and not accepting our legacy parser quirks ? It is my intention that this builtin be able to accept, parse and decompose all types of URLs that git itself can accept. > Then we still should see some IPv6 tests ? I will add them! > Or may be not, as we prefer hostnames these days ? I would have to defer that choice to someone more experienced with the codebase. Please advise on how to proceed. > The RFC 1738 uses the term "scheme" here, and using the very generic > term "protocol" may lead to name clashes later. > Would something like "git_scheme" or so be better ? Scheme does seem like a better word if it's the terminology used by RFCs. I can change that in a new version if necessary. That code is based on the existing connect.c parsing code though. > I think that the "///" version is superflous, it should already > be covered by the "//" version I thought it was a good idea because of existing precedent: my first approach to creating the test cases was to copy the ones from t0110-urlmatch-normalization.sh which did have many cases such as those. Then as I developed the code I came to believe that it was not necessary: I call url_normalize in the url_parse function and url_normalize is already being tested. I think I just forgot to delete those lines. Reading that file over once again, it does have IPv6 address test cases. So I should probably go over it again. Thanks again for the feedback, Matheus
On Mon, Apr 29, 2024 at 07:04:40PM -0300, Matheus Afonso Martins Moreira wrote: > Thank you for your feedback. > > > are there any plans to integrate the parser into connect.c and fetch ? > > Yes. > > That was my intention but I was not confident enough to touch connect.c > before getting feedback from the community, since it's critical code > and it is my first contribution. Welcome to the Git community. I wasn't aware of t0110 as a test case... > > I do want to merge all URL parsing in git into this one function though, > thereby creating a "single point of truth". This is so that if the algorithm > is modified the changes are visible to the URL parser builtin as well. > That is a good thing to do. Be prepared for a longer journey, since we have this legacy stuff to deal with. But I am happy to help with reviews, even if that may take some days, [] > When adding test cases, I looked at the possibilities enumerated in urls.txt > and generated test cases based on those. I also looked at the urlmatch.h > test cases. However... > > > Some work can be seen in t5601-clone.sh > > ... I did not think to check those. > > > Especially, when dealing with literal IPv6 addresses, > > the ones with [] and the simplified ssh syntax 'myhost:src' > > are interesting to test. > > You're right about that. I shall prepare an updated v2 patchset > with more test cases, and also any other changes/improvements > requested by maintainers. > > > And some features using the [] syntax to embedd a port number > > inside the simplified ssh syntax had not been documented, > > but used in practise, and are now part of the test suite. > > See "[myhost:123]:src" in t5601 > > Indeed, I did not read anything of the sort when I checked it. > Would you like me to commit a note to this effect to urls.txt ? On short: please not. This kind of syntax was never ment to be used. The official "ssh://myhost:123/src" is recommended. When IPv6 parsing was added, people discovered that it could be used to "protect" the ':' from being a seperator between the hostname and the path, and can be used to seperate the hostname from the port. Once that was used in real live, it was too late to change it. If we now get a better debug tool, it could mention that this is a legacy feature, and recommend the longer "ssh://" syntax. > > > Or is this new tool just a helper, to verify "good" URL's, > > and not accepting our legacy parser quirks ? > > It is my intention that this builtin be able to accept, parse > and decompose all types of URLs that git itself can accept. > > > Then we still should see some IPv6 tests ? > > I will add them! > > > Or may be not, as we prefer hostnames these days ? > > I would have to defer that choice to someone more experienced > with the codebase. Please advise on how to proceed. Re-reading this email conversation, I think that we should support (in the future), what we support today. Having a new parser tool means, that there is a chance to reject those URLs with the note/hint, that they are depracted, and should be replaced by a proper one. From my point of view this means that all existing test case should pass even with the new parser, as a general approach. Deprecating things is hard, may take years, and may be done in a seperate task/patch series. Or may be part of this one, in seperate commits. > > > The RFC 1738 uses the term "scheme" here, and using the very generic > > term "protocol" may lead to name clashes later. > > Would something like "git_scheme" or so be better ? > > Scheme does seem like a better word if it's the terminology used by RFCs. > I can change that in a new version if necessary. > That code is based on the existing connect.c parsing code though. > > > I think that the "///" version is superflous, it should already > > be covered by the "//" version > > I thought it was a good idea because of existing precedent: > my first approach to creating the test cases was to copy the > ones from t0110-urlmatch-normalization.sh which did have many > cases such as those. Then as I developed the code I came to > believe that it was not necessary: I call url_normalize > in the url_parse function and url_normalize is already being > tested. I think I just forgot to delete those lines. > > Reading that file over once again, it does have IPv6 address > test cases. So I should probably go over it again. > > Thanks again for the feedback, > > Matheus >
Git commands accept a wide variety of URLs syntaxes, not just standard URLs. This can make parsing git URLs difficult since standard URL parsers cannot be used. Even if an external parser were implemented, it would have to track git's development closely in case support for any new URL schemes are added. These patches introduce a new url-parse builtin command that exposes git's native URL parsing algorithms as a plumbing command, allowing other programs to then call upon git itself to parse the git URLs and their components. This should be quite useful for scripts. For example, a script might want to add remotes to repositories, naming them according to the domain name where the repository is hosted. This new builtin allows it to parse the git URL and extract its host name which can then be used as input for other operations. This would be difficult to implement otherwise due to git's support for scp style URLs. Signed-off-by: Matheus Afonso Martins Moreira matheus@matheusmoreira.com Matheus Afonso Martins Moreira (13): url: move helper function to URL header and source urlmatch: define url_parse function builtin: create url-parse command url-parse: add URL parsing helper function url-parse: enumerate possible URL components url-parse: define component extraction helper fn url-parse: define string to component converter fn url-parse: define usage and options url-parse: parse options given on the command line url-parse: validate all given git URLs url-parse: output URL components selected by user Documentation: describe the url-parse builtin tests: add tests for the new url-parse builtin .gitignore | 1 + Documentation/git-url-parse.txt | 59 ++++++++++ Makefile | 1 + builtin.h | 1 + builtin/url-parse.c | 132 ++++++++++++++++++++++ command-list.txt | 1 + connect.c | 8 -- connect.h | 1 - git.c | 1 + remote.c | 1 + t/t9904-url-parse.sh | 194 ++++++++++++++++++++++++++++++++ url.c | 8 ++ url.h | 2 + urlmatch.c | 90 +++++++++++++++ urlmatch.h | 1 + 15 files changed, 492 insertions(+), 9 deletions(-) create mode 100644 Documentation/git-url-parse.txt create mode 100644 builtin/url-parse.c create mode 100755 t/t9904-url-parse.sh base-commit: e326e520101dcf43a0499c3adc2df7eca30add2d Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1715%2Fmatheusmoreira%2Furl-parse-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1715/matheusmoreira/url-parse-v1 Pull-Request: https://github.com/git/git/pull/1715