View previous topic :: View next topic |
Author |
Message |
amandini n00b
Joined: 01 Jul 2020 Posts: 27
|
Posted: Mon Nov 18, 2024 6:25 am Post subject: Segentation fault with Eigen library and GMP rationals |
|
|
Hello,
I am doing some work using the Eigen C++ template library, with large sparse matrices, say 200,000 x 200,000, over the mpq_class provided by the GNU Multiple Precision (GMP) library.
The first problem is that code which compiles without any warnings on the godbolt compiler, a small example of which is given at https://godbolt.org/z/j66jdK7n6, produces the following warnings when compiled on my gentoo system
Code: |
#g++ -std=c++20 -O2 -march=native -lgmp -lgmpxx -Iboost -Wall -Wpedantic -Wextra example.cpp -o example.o
In file included from /usr/include/eigen3/Eigen/SparseCore:40,
from example.cpp:6:
/usr/include/eigen3/Eigen/src/SparseCore/AmbiVector.h: In instantiation of ‘void Eigen::internal::AmbiVector<_Scalar, _StorageIndex>::reallocateSparse() [with _Scalar = __gmp_expr<__mpq_struct [1], __mpq_struct [1]>; _StorageIndex = int]’:
/usr/include/eigen3/Eigen/src/SparseCore/AmbiVector.h:238:11: required from ‘_Scalar& Eigen::internal::AmbiVector<_Scalar, _StorageIndex>::coeffRef(Eigen::Index) [with _Scalar = __gmp_expr<__mpq_struct [1], __mpq_struct [1]>; _StorageIndex = int; Eigen::Index = long int]’
/usr/include/eigen3/Eigen/src/SparseCore/TriangularSolver.h:233:28: required from ‘static void Eigen::internal::sparse_solve_triangular_sparse_selector<Lhs, Rhs, Mode, UpLo, 0>::run(const Lhs&, Rhs&) [with Lhs = const Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]> >; Rhs = Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]> >; int Mode = 6; int UpLo = 2]’
/usr/include/eigen3/Eigen/src/SparseCore/TriangularSolver.h:306:93: required from ‘void Eigen::TriangularViewImpl<MatrixType, Mode, Eigen::Sparse>::solveInPlace(Eigen::SparseMatrixBase<OtherDerived>&) const [with OtherDerived = Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]> >; MatrixType = const Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]> >; unsigned int Mode = 6]’
example.cpp:80:61: required from here
/usr/include/eigen3/Eigen/src/SparseCore/AmbiVector.h:97:18: warning: ‘void* memcpy(void*, const void*, size_t)’ writing to an object of type ‘Eigen::internal::AmbiVector<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, int>::Scalar’ {aka ‘class __gmp_expr<__mpq_struct [1], __mpq_struct [1]>’} with no trivial copy-assignment; use copy-assignment or copy-initialization instead [-Wclass-memaccess]
97 | std::memcpy(newBuffer, m_buffer, copyElements * sizeof(ListEl));
| ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from example.cpp:4:
/usr/include/gmpxx.h:1763:7: note: ‘Eigen::internal::AmbiVector<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, int>::Scalar’ {aka ‘class __gmp_expr<__mpq_struct [1], __mpq_struct [1]>’} declared here
1763 | class __gmp_expr<mpq_t, mpq_t>
| ^~~~~~~~~~~~~~~~~~~~~~~~
|
Eigen developers informed me that they could not reproduce these warnings on their machines. Nor are the warnings reproducible on the compiler explorer https://godbolt.org/z/j66jdK7n6.
Second and more pressing issue is that the process of constructing large sparse matrices (175,175 x 175,175) over mpq_class in Eigen causes the following segmentation fault:
Code: | Thread 1 "main.o" received signal SIGSEGV, Segmentation fault.
0x00007ffff79d525e in free () from /lib64/libc.so.6
(gdb) bt
#0 0x00007ffff79d525e in free () from /lib64/libc.so.6
#1 0x00007ffff7edee80 in __gmpq_clear () from /usr/lib64/libgmp.so.10
#2 0x000055555555ff4d in __gmp_expr<__mpq_struct [1], __mpq_struct [1]>::~__gmp_expr (this=0x7fffffffd510, __in_chrg=<optimized out>)
at /usr/include/gmpxx.h:1855
#3 Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, 1, int>::insertBackUncompressed (col=<optimized out>, row=<optimized out>, this=0x7fffffffd4c0) at /usr/include/eigen3/Eigen/src/SparseCore/SparseMatrix.h:904
#4 Eigen::internal::set_from_triplets<__gnu_cxx::__normal_iterator<Eigen::Triplet<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, int> const*, std::vector<Eigen::Triplet<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, int>, std::allocator<Eigen::Triplet<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, int> > > >, Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, 0, int>, Eigen::internal::scalar_sum_op<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, __gmp_expr<__mpq_struct [1], __mpq_struct [1]> > > (begin=..., end=..., mat=..., dup_func=...) at /usr/include/eigen3/Eigen/src/SparseCore/SparseMatrix.h:1056
#5 0x000055555556cf22 in Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, 0, int>::setFromTriplets<__gnu_cxx::__normal_iterator<Eigen::Triplet<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, int> const*, std::vector<Eigen::Triplet<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, int>, std::allocator<Eigen::Triplet<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, int> > > > > (end=..., begin=..., this=0x7fffffffdb78) at /usr/include/eigen3/Eigen/src/Core/functors/BinaryFunctors.h:36
#6 p_basis<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, 0, int> >::basis_builder (this=this@entry=0x7fffffffdb38, src=..., seminormal_form=..., p=p@entry=3) at /home/user_name/path_to_file/p_basis.h:165
#7 0x0000555555570c69 in p_basis<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, 0, int > >::p_basis (this=0x7fffffffdb38, src=..., seminormal_form=..., p=3) at /home/user_name/path_to_file/p_basis.h:57
#8 0x00005555555718c5 in basis_data<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, 0, int> >::basis_data (p=3, src=..., this=0x7fffffffdaf8) at /home/user_name/path_to_file/basis_data.h:61
#9 mform<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, 0, int> >::mform (this=0x7fffffffdaf0, src=..., path="/home/user_name/output.txt", p=3) at /home/user_name/path_to_file/mform.h:36
#10 0x0000555555559995 in main () at /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/bits/basic_string.tcc:242
(gdb) |
The code that produced the segmentation fault above, was
Code: | typedef Eigen::Triplet<mpq_class> qtriplet;
std::vector<qtriplet> triplets{};
|
and line 165 of the file p_basis.h referred to in the gdb backtrace aboveis
Code: | m_basis.setFromTriplets(std::cbegin(triplets), std::cend(triplets)); |
Would anyone at the Gentoo Science Project (Eigen) or the Gentoo Toolchain Project (GMP) be willing or able to help look into these issues?
Any other suggestions that may help get to the bottom of this issue would be gratefully received.
[edited for typos]
Last edited by amandini on Mon Nov 18, 2024 8:25 am; edited 1 time in total |
|
Back to top |
|
|
sam_ Developer
Joined: 14 Aug 2020 Posts: 1972
|
Posted: Mon Nov 18, 2024 6:37 am Post subject: |
|
|
I see the warning locally as well but not on godbolt. Not sure why yet.
What does -march=native expand to on your system? You can check by running resolve-march-native from app-misc/resolve-march-native. I do see it without but it may be useful to know nonetheless.
Could you give me a testcase which segfaults? Can look more then. |
|
Back to top |
|
|
amandini n00b
Joined: 01 Jul 2020 Posts: 27
|
Posted: Mon Nov 18, 2024 7:05 am Post subject: |
|
|
sam_ wrote: | What does -march=native expand to on your system? You can check by running resolve-march-native from app-misc/resolve-march-native.
I see the warning locally as well but not on godbolt. Not sure why yet.
Could you give me a testcase which segfaults? |
Thank you for your reply. I run these calculations on two workstations, on one -march=native resolves as
Code: | -march=tigerlake -mabm -mno-kl -mno-sgx -mno-widekl -mshstk --param=l1-cache-line-size=64 --param=l1-cache-size=48 --param=l2-cache-size=24576 |
and the other it resolves as
Code: | march=znver2 --param=l1-cache-line-size=64 --param=l1-cache-size=32 --param=l2-cache-size=512 |
The compiler warnings occur identically on both the znver2 and the tigerlake machine.
The example on godbolt was only contrived to illustrate the compiler warnings. As it is, it took many lines of code and over 500GB of ram on the AMD machine to produce the segmentation fault, so it may not be practical to share this specific example.
I have not tried to reproduce the segmentation fault on the intel machine which has less memory.
However, it may be possible to come up with a contrived and simpler example that reproduces the segmentation fault. Unfortunately may take a bit of time to get such a segmentation fault example working - my apologies in advance.
Should I go ahead and try to make such an example?
Last edited by amandini on Tue Nov 19, 2024 12:33 am; edited 1 time in total |
|
Back to top |
|
|
bstaletic Guru
Joined: 05 Apr 2014 Posts: 374
|
Posted: Mon Nov 18, 2024 8:02 am Post subject: |
|
|
I'm not familiar with GMP, but that operator< looks very wrong - sometimes it tells you which of the two inputs is smaller and sometimes it does the opposite.
Code: |
x y | result
---------------
0 0 | false
0 y | true
x 0 | false
x 2x | false
2x x | true
|
The last two return true if x is greater than y.
The first three return true if y is greater than x. |
|
Back to top |
|
|
GDH-gentoo Veteran
Joined: 20 Jul 2019 Posts: 1709 Location: South America
|
Posted: Mon Nov 18, 2024 8:05 pm Post subject: |
|
|
I don't know why the Compiler Explorer doesn't output the warning, but it makes sense. The code in Eigen::internal::AmbiVector<>'s reallocateSparse() member function uses std::memcpy() to copy the value of an object whose type is the type denoted by the class template's _Scalar parameter. I believe that you can have undefined behaviour when using objects of nontrivially copyable types made like that... and GMP's mpq_class is definitely not trivially copyable, it's a class that has nontrivial default and copy constructors, a nontrivial copy assignment operator, and a nontrivial destructor.
I don't know in general if Eigen documents its expectations about the types that are passed as template arguments like this, examples seem to show only arithmetic types like double or at most trivially copyable classes like std::complex<float>. The segfault shown in the backtrace happens in the destructor of some mpq_class object, hard to say if it was properly constructed. _________________
NeddySeagoon wrote: | I'm not a witch, I'm a retired electronics engineer |
Ionen wrote: | As a packager I just don't want things to get messier with weird build systems and multiple toolchains requirements though |
|
|
Back to top |
|
|
amandini n00b
Joined: 01 Jul 2020 Posts: 27
|
Posted: Mon Nov 18, 2024 10:41 pm Post subject: |
|
|
GDH-gentoo wrote: | I don't know why the Compiler Explorer doesn't output the warning, but it makes sense. The code in Eigen::internal::AmbiVector<>'s reallocateSparse() member function uses std::memcpy() to copy the value of an object whose type is the type denoted by the class template's _Scalar parameter. I believe that you can have undefined behaviour when using objects of nontrivially copyable types made like that... and GMP's mpq_class is definitely not trivially copyable, it's a class that has nontrivial default and copy constructors, a nontrivial copy assignment operator, and a nontrivial destructor.
I don't know in general if Eigen documents its expectations about the types that are passed as template arguments like this, examples seem to show only arithmetic types like double or at most trivially copyable classes like std::complex<float>. The segfault shown in the backtrace happens in the destructor of some mpq_class object, hard to say if it was properly constructed. |
Thank you for your reply.
Apropos of the compiler warnings, an Eigen developer just posted the Discord channel that 'eigen should use the copy/move ctor instead of memcpy if the type isn't trivially copyable' and provided the following link:
https://gitlab.com/libeigen/eigen/-/blob/master/Eigen/src/SparseCore/AmbiVector.h?ref_type=heads#L91
It therefore seems that warnings may be an Eigen issue.
The segmentation fault remains a mystery - I had previously computed examples of up to dimensions 125,125 x 125,125 without encountering the problem. It has only given the segmentation fault when I moved to the next example which happens to have dimension 175,175 x 175,175 (irrelevant numerological observation: the dimensions in these two examples happen to be divisible by 1001 = 7 * 11 * 13).
Stripping out the extraneous parts of the code, I'll construct the matrix using the same mp_class elements stored in a CSV file the and see if the segmentation fault can be replicated that way. |
|
Back to top |
|
|
amandini n00b
Joined: 01 Jul 2020 Posts: 27
|
Posted: Mon Nov 18, 2024 11:07 pm Post subject: |
|
|
bstaletic wrote: | I'm not familiar with GMP, but that operator< looks very wrong - sometimes it tells you which of the two inputs is smaller and sometimes it does the opposite.
Code: |
x y | result
---------------
0 0 | false
0 y | true
x 0 | false
x 2x | false
2x x | true
|
The last two return true if x is greater than y.
The first three return true if y is greater than x. |
Thank you for your reply.
The code containing the operator came from Eigen https://eigen.tuxfamily.org/dox/TopicCustomizing_CustomScalar.html |
|
Back to top |
|
|
GDH-gentoo Veteran
Joined: 20 Jul 2019 Posts: 1709 Location: South America
|
Posted: Tue Nov 19, 2024 9:56 pm Post subject: |
|
|
amandini wrote: | The code that produced the segmentation fault above, was
Code: | typedef Eigen::Triplet<mpq_class> qtriplet;
std::vector<qtriplet> triplets{};
|
and line 165 of the file p_basis.h referred to in the gdb backtrace aboveis
Code: | m_basis.setFromTriplets(std::cbegin(triplets), std::cend(triplets)); |
|
amandini wrote: | It has only given the segmentation fault when I moved to the next example which happens to have dimension 175,175 x 175,175 [...]. |
So, m_basis here has type Eigen::SparseMatrix<mpq_class, 0, int>, right? Do you mean that when you get the segfault, m_basis.rows() and m_basis.cols() are both 175175? How many elements does triplets actually have when m_basis.setFromTriplets() is called and segfaults? _________________
NeddySeagoon wrote: | I'm not a witch, I'm a retired electronics engineer |
Ionen wrote: | As a packager I just don't want things to get messier with weird build systems and multiple toolchains requirements though |
|
|
Back to top |
|
|
amandini n00b
Joined: 01 Jul 2020 Posts: 27
|
Posted: Wed Nov 20, 2024 7:25 am Post subject: |
|
|
GDH-gentoo wrote: | amandini wrote: | The code that produced the segmentation fault above, was
Code: | typedef Eigen::Triplet<mpq_class> qtriplet;
std::vector<qtriplet> triplets{};
|
and line 165 of the file p_basis.h referred to in the gdb backtrace aboveis
Code: | m_basis.setFromTriplets(std::cbegin(triplets), std::cend(triplets)); |
|
amandini wrote: | It has only given the segmentation fault when I moved to the next example which happens to have dimension 175,175 x 175,175 [...]. |
So, m_basis here has type Eigen::SparseMatrix<mpq_class, 0, int>, right? Do you mean that when you get the segfault, m_basis.rows() and m_basis.cols() are both 175175? How many elements does triplets actually have when m_basis.setFromTriplets() is called and segfaults? |
Thank you for your reply.
Yes, the dimensions and type you have quoted are all correct.
There are 3,010,744,671 elements in triplets when I initialise the matrix at the line Code: | m_basis.setFromTriplets(std::cbegin(triplets), std::cend(triplets)); |
|
|
Back to top |
|
|
Hu Administrator
Joined: 06 Mar 2007 Posts: 22694
|
Posted: Wed Nov 20, 2024 12:55 pm Post subject: |
|
|
The warning about memcpy is because classes with non-trivial copy constructors often (but not always) have invariants that the copy constructor will maintain, but that a straight memcpy will not. A simple example of that would be std::vector: the copy constructor is nontrivial because the copy gets its own distinct storage, with its own copy of the contained objects. Using memcpy to copy a vector would result in two vector instances that point to the same storage, and both will try to free that storage when destroyed. That free is undefined behavior and may (or may not) cause a crash due to heap corruption.
I would argue that until you can patch Eigen to build cleanly with -Werror=class-memaccess, it is not worth your time investigating the crash further. Eliminating the undefined behavior of copying the non-trivial class is worthwhile, and may (or may not) fix your crash as a side effect. |
|
Back to top |
|
|
GDH-gentoo Veteran
Joined: 20 Jul 2019 Posts: 1709 Location: South America
|
Posted: Wed Nov 20, 2024 12:55 pm Post subject: |
|
|
amandini wrote: | There are 3,010,744,671 elements in triplets when I initialise the matrix at the line Code: | m_basis.setFromTriplets(std::cbegin(triplets), std::cend(triplets)); |
|
Since what happens looks like a memory management problem, I'm not sure that the values stored in mpq_class objects matters, so the first thing I would do to try to reproduce the segmentation fault is write a small program that creates a 175,175 x 175,175 SparseMatrix<mpq_class, 0, int> object, a std::vector<Triplet<mpq_class>> object with 3,010,744,671 very simple to compute elements —but without repeating (row, column) pairs, since that executes an extra code path—, and calls setFromTriplets(). I won't have access to my Gentoo computer for several hours, but I might post one if you don't beat me to it in the meantime. _________________
NeddySeagoon wrote: | I'm not a witch, I'm a retired electronics engineer |
Ionen wrote: | As a packager I just don't want things to get messier with weird build systems and multiple toolchains requirements though |
|
|
Back to top |
|
|
GDH-gentoo Veteran
Joined: 20 Jul 2019 Posts: 1709 Location: South America
|
Posted: Wed Nov 20, 2024 11:53 pm Post subject: |
|
|
GDH-gentoo wrote: | [...] the first thing I would do to try to reproduce the segmentation fault is write a small program that creates a 175,175 x 175,175 SparseMatrix<mpq_class, 0, int> object, a std::vector<Triplet<mpq_class>> object with 3,010,744,671 very simple to compute elements —but without repeating (row, column) pairs, since that executes an extra code path—, and calls setFromTriplets() |
Here it is, based on the example in section "Filling a sparse matrix" (semitested because of memory constraints and not wanting to install Eigen) :
Code: | #include <Eigen/SparseCore>
#include <gmpxx.h>
#include <iostream>
#include <vector>
namespace Eigen {
template<> struct NumTraits<mpq_class>: GenericNumTraits<mpq_class> {
typedef mpq_class Real;
typedef mpq_class NonInteger;
typedef mpq_class Nested;
static inline Real epsilon() { return 0; }
static inline Real dummy_precision() { return 0; }
static inline int digits10() { return 0; }
enum {
IsInteger = 0,
IsSigned = 1,
IsComplex = 0,
RequireInitialization = 1,
ReadCost = 6,
AddCost = 150,
MulCost = 100
};
};
}
int main() {
using namespace Eigen;
using qtriplet = Triplet<mpq_class>;
std::vector<qtriplet> t;
t.reserve(3010744671UL);
const Index nrows = 175175UL;
unsigned long k = 0;
for (Index i = 0; i < nrows; i++)
for (Index j = 0; j < 17187UL; j++) t.push_back(qtriplet(i, j, mpq_class(k++)));
for (Index i = 0; i < 11946UL; i++) t.push_back(qtriplet(i, 17188UL, mpq_class(k++)));
std::cout << "Created vector of " << k << " qtriplets\nTrying setFromTriplets()\n";
SparseMatrix<mpq_class> m(nrows, nrows);
m.setFromTriplets(t.cbegin(), t.cend());
std::cout << "I didn't explode\n";
return 0;
} |
Does this segfault? _________________
NeddySeagoon wrote: | I'm not a witch, I'm a retired electronics engineer |
Ionen wrote: | As a packager I just don't want things to get messier with weird build systems and multiple toolchains requirements though |
|
|
Back to top |
|
|
amandini n00b
Joined: 01 Jul 2020 Posts: 27
|
Posted: Thu Nov 21, 2024 9:56 am Post subject: |
|
|
Hu wrote: | The warning about memcpy is because classes with non-trivial copy constructors often (but not always) have invariants that the copy constructor will maintain, but that a straight memcpy will not. A simple example of that would be std::vector: the copy constructor is nontrivial because the copy gets its own distinct storage, with its own copy of the contained objects. Using memcpy to copy a vector would result in two vector instances that point to the same storage, and both will try to free that storage when destroyed. That free is undefined behavior and may (or may not) cause a crash due to heap corruption.
I would argue that until you can patch Eigen to build cleanly with -Werror=class-memaccess, it is not worth your time investigating the crash further. Eliminating the undefined behavior of copying the non-trivial class is worthwhile, and may (or may not) fix your crash as a side effect. |
Thank you for the replies.
I agree that the non-trivial copy constructor issue in Eigen needs to be solved as it is possibly the root cause of the fault. The issue has now been flagged by the developers: https://gitlab.com/libeigen/eigen/-/issues/2873.
In the meantime, thinking about the question that GDH-gentoo asked about the number of non-zero elements in the matrix, I realised that because there is a equivalence relation on the entries that is used to set many of them to zero, it would be more efficient to delete these entries before rather than after the matrix is initialised.
With this change to the program, the matrix for the same example contains 1,478,894 elements and the segmentation fault no longer occurs for this particular example.
I should have done the calculation this way in the first place. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|